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Preface 


SEMANTICS 2019 took place during September 9-12, 2019, in Karlsruhe, Germany. 
SEMANTICS offers a forum for the exchange of latest scientific results in semantic 
systems and complements these topics with new research challenges in areas like data 
science, machine learning, logic programming, content engineering, social computing, 
Semantic Web, and many more. This year was the 15th edition of the SEMANTiCS 
conference series, which has developed into an internationally visible and professional 
academic event. 

Participants learn from top researchers and industry experts about emerging trends 
and topics in the wide area of semantic computing. The SEMANTiCS community is 
highly diverse; attendees have responsibilities in interlinking areas such as artificial 
intelligence, knowledge discovery and management, big data analytics, e-commerce, 
enterprise search, technical documentation, document management, business 
intelligence, and enterprise vocabulary management. 

This year the SEMANTICS conference’s subtitle was “The Power of AI and 
Knowledge Graphs,” and especially welcomed submissions to the following hot topics: 


— Web Semantics and Linked (Open) Data 

— Enterprise Knowledge Graphs, Graph Data Management, and Deep Semantics 
— Machine Learning and Deep Learning Techniques 

— Semantic Information Management and Knowledge Integration 
— Terminology, Thesaurus, and Ontology Management 

— Data Mining and Knowledge Discovery 

— Reasoning, Rules, and Policies 

— Natural Language Processing 

— Data Quality Management and Assurance 

— Explainable Artificial Intelligence 

— Semantics in Data Science 

— Semantics in Blockchain and Distributed Ledger Technologies 
— Trust, Data Privacy, and Security with Semantic Technologies 
— Economics of Data, Data Services, and Data Ecosystems 


We additionally issued calls for two special tracks: 


— Digital Humanities and Cultural Heritage 
— LegalTech 


Following the great success of SEMANTiCS 2018 in Vienna, we received 88 
submissions. In order to properly provide high-quality reviews to these submissions, 
we set up a Program Committee (PC) comprising of 111 members to help us select the 
papers with the highest impact and scientific merit. For each submission, at least three 
reviews were written independently from the assigned reviewers in a single-blind 
review process (author names are visible to reviewers, but reviewers stay anonymous). 
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After all reviews were submitted, the PC chairs compared the reviews and discussed 
discrepancies and different opinions with the reviewers to facilitate a meta-review and 
suggest a recommendation to accept or reject the paper. Overall, we accepted 20 full 
papers and 8 short papers from the 88 submissions which resulted in a full paper 
acceptance rate of 23%. 

The program of SEMANTICS 2019 was structured as follows. In the main 
conference, the contributors of full papers including posters and industry talks gave 
their presentations in thematically grouped sessions. These presentations covered a 
broad palette on current trends and developments in semantic technologies. To support 
the knowledge transfer between the academic and industrial communities, scientific 
papers and industry papers were grouped according to the following thematic sessions: 


— Semantic Information Management 

— Knowledge Discovery and Semantic Search 

— Knowledge Graphs 

— Knowledge Extraction 

— Natural Language Processing 

— Thesaurus and Ontology Management 

— Linked Data and Data Integration 

— Distributed Ledger Technologies 

— Smart Connectivity and Interlinking 

— Special Track: LegalTech 

— Special Track: Digital Humanities and Cultural Heritage 

— Special Track: Knowledge Organization and Application for Complex Industry 
Settings 


The Posters and Demos Track provided an opportunity to present late-breaking 
research results, smaller contributions, and innovative work in progress. 29 original 
submissions and 2 re-submissions from the research track were accepted to this track, 
selected with a peer-reviewing process from a total of 47 poster and demo submissions. 
The reviewing committee, which included 88 members, provided at least three reviews 
per submission. The accepted works have been published within the CEUR Workshop 
Proceedings series. 

Besides the scientific track of the conference, a call for industry presentations was 
launched, which resulted in 47 submissions of which 37 were accepted for presentation 
in the industry track. Additionally, an exhibition took place where organizations 
presented their semantics-based products and services. 

Deliberate long breaks, in a well-suited venue, took place throughout the conference 
and social events provided excellent opportunities for networking with people 
interested in semantics-related topics from different disciplines and parts of the world. 

We are grateful to our keynote and invited speakers for sharing their ideas about the 
future development of knowledge management, new media, and semantic technologies 
with our attendees: 
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Keynote Speakers: 


— Michael J. Sullivan (Oracle): “Hybrid Knowledge Management Architectures” 

— Michel Dumontier (Maastricht University): “Accelerating Biomedical Discovery 
with an Internet of FAIR Data and Services” 

— Andy Boyd and Brendan Nielsen (Shell): “High-grading Business Decisions 
through Semantic Technology” 

— Valentina Presutti (Consiglio Nazionale delle Ricerche): “Looking for Common 
Sense in the Semantic Web” 

— Katja Hose (Aalborg University): “Querying the Web of Data” 


Invited Speakers: 


— Andreas Harth (Fraunhofer Institute): “From Representing Knowledge to 
Representing Behaviour” 
— Christian Dirschl (Wolters Kluwer): “LegalTech — To whom it may concern” 


Many thanks also go to all authors who submitted papers and of course to the PC 
who provided careful reviews in a quick turnaround time. 

Special thanks go to Christian Dirschl (Wolters Kluwer Germany) and Andreas 
Blumauer (Semantic Web Company) who organized all industry related activities. We 
also would like to thank Thomas Thurner and Martin Kaltenbéck from the Semantic 
Web Company for providing the organizational infrastructure and taking care of all the 
operational tasks. Additionally, we would also like to thank our local organization team 
Stefan Summesberger, Viviene Vetter, and Julia Holze, as well as all those helpful 
hands that are too many to name for supporting this year’s conference and turning it 
into a success. 

We would also like to thank our sponsors (i.a.0.): 


— Premium Sponsors: eccenca, PoolParty, FIZ Karlsruhe, and CAS 

— Gold Sponsors: Semiodesk, metaphacts, and i-views 

— Silver Sponsors: Siemens, Ontotext, Franz Inc., Allegrograph, Enterprise Knowl- 
edge, Deloitte, and HP Motion Content 

— Bronze and Research: CID, Fraunhofer IAIS, Bosch, inovex, Oracle, Prét-a-LLOD, 
STI Innsbruck, GNOSS, Klarso, Ontopic, and SICK 


Special thanks also go to the partners of the conference who are: 

University of Basel, BID - Bibliothek & Information International, Cefriel, 
Connected Data London, Consiglo Nazionale delle Ricerche, Cyberforum, DBpedia, 
eccenca, FIZ Karlsruhe, GFWM, IBM, KIT - Karlsruhe Institute of Technology, TIB, 
University of Paderborn, University of Fribourg, Springer LNCS, Wolters Kluwer, and 
WU Vienna. 
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We hope that SEMANTICS 2019 will provide you with new inspirations for your 
research and with opportunities for partnerships with other research groups, academic, 
and industrial participants. 


September 2019 Maribel Acosta 
Philippe Cudré-Mauroux 

Maria Maleshkova 

Tassilo Pellegrini 

Harald Sack 

York Sure-Vetter 


Organization Chairs 


Conference Chairs 


Harald Sack FIZ Karlsruhe — Leibniz Institute for Information 
Infrastructure, Germany 

York Sure Vetter Karlsruhe Institute of Technology, Germany 

Tassilo Pellegrini St. Pölten University of Applied Sciences, Austria 


Research and Innovation Chairs 


Maribel Acosta Karlsruhe Institute of Technology, Germany 
Philippe Cudré-Mauroux Université de Fribourg, Switzerland 


Special Track Chairs 


Sabrina Kirrane Institute for Information Business of WU Wien, Austria 
Victor de Boer Vrije Universiteit Amsterdam, The Netherlands 


Industry and Use Case Chairs 


Christian Dirschl Wolters Kluwer Germany, Germany 
Andreas Blumauer Semantic Web Company, Austria 


Poster and Demo Track Chairs 


Mehwish Alam FIZ Karlsruhe — Leibniz Institute for Information 
Infrastructure, Germany 
Ricardo Usbeck Paderborn University, Germany 


Workshop and Satellite Events Chairs 


Anna Lisa Gentile IBM Almaden Research Center, USA 
Irene Celino Cerfriel, Politecnico di Milano, Italy 


Proceedings Chairs 


Maria Maleshkova University of Bonn, Germany 
Tassilo Pellegrini St. Pölten University of Applied Sciences, Austria 


x Organization Chairs 


Promotion Chairs 


Thomas Thurner 
Julia Holze 
Stefan Summesberger 


Local Chairs 


Thomas Thurner 
Vivien Vetter 


Sponsoring Chair 


Stefan Summesberger 


Semantic Web Company, Austria 
AKSW, InfAI, Leipzig University, Germany 
plantsome communication, Austria 


Semantic Web Company, Austria 
FIZ Karlsruhe — Leibniz Institute for Information 
Infrastructure, Germany 


plantsome communication, Austria 


Permanent Advisory Board 


Soren Auer 


Andreas Blumauer 
Tobias Bürger 
Christian Dirschl 
Victor de Boer 
Anna Fensel 

Dieter Fensel 

Mike Heininger 
Sebastian Hellmann 


Ute John 

Martin Kaltenbéck 
Elmar Kiesling 
Tassilo Pellegrini 
Axel Polleres 
Felix Sasaki 
Harald Sack 


Fraunhofer Institute for Intelligent Analysis 
and Information Systems, Germany 

Semantic Web Company, Austria 

BMW Group, Germany 

Wolters Kluwer Germany, Germany 

Vrije Universiteit Amsterdam, The Netherlands 

Semantic Technology Institute (STI) Innsbruck, Austria 

Semantic Technology Institute (STI) Innsbruck, Austria 

GfWM Austria, Austria 

Institute of Applied Informatics e.V. at the University 
of Leipzig, Germany 

GfWM Germany, WissensWertSchépfung, Germany 

Semantic Web Company, Austria 

TU Wien, Austria 

St. Pölten University of Applied Sciences, Austria 

Institute for Information Business of WU Wien, Austria 

DFKI, W3C Fellow, Germany 

FIZ Karlsruhe — Leibniz Institute for Information 
Infrastructure and Karlsruhe Institute of Technology 
(KIT), Germany 


Program Committee - Research and Innovation Track 


and Special Tracks 


Harith Alani 
Vito Walter 
Luigi Asprino 
Sören Auer 


The Open University 

Anelli Politecnico di Bari 

University of Bologna, STLab (ISTC-CNR) 
TIB, University of Hannover 


Nathalie Aussenac-Gilles 
Sebastian Bader 


Stefan Bischof 
Carlos Bobed 
Loris Bozzato 
Carlos Buil-Aranda 
Paul Buitelaar 


Irene Celino 

Davide Ceolin 
Pierre-Antoine Champin 
Vinay Chaudhri 
Ioannis Chrysakis 
Ioana-Georgiana Ciuciu 
Oscar Corcho 

Gianluca Correndo 
Enrico Daga 

Ben De Meester 

Elena Demidova 

Sylvie Despres 


Chiara Di Francescomarino 
Stefan Dietze 

Anastasia Dimou 

Jens Dérpinghaus 
Mauro Dragoni 

Anca Dumitrache 
Jérôme Euzenat 
Victoria Eyharabide 
Michael Farber 
Catherine Faron Zucker 
Said Fathalla 

Ingo Feinerer 

Javier D. Fernandez 
Agata Filipowska 

Nuno Freire 

Roberto Garcia 

Raul Garcia-Castro 
Daniel Garijo 

Annalisa Gentile 

Jose Manuel Gomez-Perez 
Michael Granitzer 
Alasdair Gray 

Paul Groth 

Peter Haase 


Organization Chairs xi 


IRIT, CNRS 

Fraunhofer-Institut für Intelligente Analyse- und 
Informationssysteme IAIS 

Siemens AG Österreich 

everis, NTT Data 

Fondazione Bruno Kessler 

Universidad Técnica Federico Santa María 

Insight Centre for Data Analytics, 
National University of Ireland Galway 

Ceriel 

Vrije Universiteit Amsterdam 

Liris, Université Claude Bernard Lyon1 

SRI International, USA 

FORTH-ICS, Greece 

Babes-Bolyai University 

Universidad Politécnica de Madrid 

University of Southampton 

The Open University 

Ghent University 

L3S Research Center 

Laboratoire d’Informatique Médicale et de 
BIOinformatique (LIM&BIO) 

Fondazione Bruno Kessler-Irst 

GESIS - Leibniz Institute for the Social Sciences 

Ghent University 

Fraunhofer 

Fondazione Bruno Kessler-Irst 

Vrije Universiteit Amsterdam 

Inria, University of Grenoble Alpes 

STIH Laboratory, Sorbonne University 

University of Freiburg 

Université Nice Sophia Antipolis 

University of Bonn 

University of Applied Sciences Wiener Neustadt 

Vienna University of Economics and Business 

Poznan University of Economics 

INESC-ID 

Universitat de Lleida 

Universidad Politécnica de Madrid 

Information Sciences Institute 

IBM 

ExpertSystem 

University of Passau 

Heriot-Watt University 

University of Amsterdam 

metaphacts 


xii Organization Chairs 


Benjamin Heitmann 
Lars Heling 

Eelco Herder 

Pieter Heyvaert 
Rinke Hoekstra 
Geert-Jan Houben 
Zhisheng Huang 
Shimaa Ibrahim 
Marc Jacobs 

Tobias Käfer 
Lucie-Aimée Kaffee 
Elias Kärle 

Tomi Kauppinen 
Dimitris Kontokostas 
Efstratios Kontopoulos 


Tobias Kuhn 
Christoph Lange 
Maxime Lefrançois 
Isaac Lera 
Steffen Lohmann 
Vanessa Lopez 
Vincent Lully 
Nicole Merkle 
Lyndon Nixon 
Leo Obrst 

Jan Oevermann 


Harshvardhan Jitendra 
Pandit 

Heiko Paulheim 

Catia Pesquita 

Jasmin Pielorz 

Jędrzej Potoniec 

Cédric Pruski 

Filip Radulovic 

Alessandro Raganato 

Artem Revenko 

Giuseppe Rizzo 

Oscar Rodríguez Rocha 

Anisa Rula 

Marta Sabou 

Vadim Savenkov 

Stefan Schlobach 

Pavel Shvaiko 

Ruben Taelman 


RWTH Aachen University 

Karlsruhe Institute of Technology 

Radboud University 

IDLab Ghent University — imec, Belgium 

University of Amsterdam 

Delft University of Technology 

Vrije Universiteit Amsterdam 

Bonn University 

Fraunhofer 

Karlsruhe Institute of Technology 

University of Southampton 

STI-Innsbruck 

Aalto University School of Science 

University of Leipzig 

Information Technologies Institute, Centre 
for Research & Technology — Hellas, Greece 

Vrije Universiteit Amsterdam 

Fraunhofer FIT, Germany 

MINES Saint-Etienne 

UIB 

Fraunhofer 

IBM 

Sorbonne Université, France 

FZI Forschungszentrum Informatik am KIT 

MODUL Technology GmbH 

MITRE 

University of Bremen, German Research Center 
for Artificial Intelligence (DFKI) 

ADAPT, Trinity College Dublin 


University of Mannheim 

LaSIGE, Universidade de Lisboa 

Austrian Institute of Technology 

Poznan University of Technology 

Luxembourg Institute of Science and Technology 
Sépage in Paris, France 

University of Helsinki 

Semantic Web Company GmbH 

LINKS Foundation 

Inria 

University of Milano-Bicocca 

Vienna University of Technology 

Vienna University of Economics and Business (WU) 
Vrije Universiteit Amsterdam 

Informatica Trentina 

Ghent University — imec 


Organization Chairs xiii 


Sanju Tiwari Ontology Engineering Group 
Konstantin Todorov LIRMM, University of Montpellier 
Riccardo Tommasini Politecnico di Milano 

Jürgen Umbrich Vienna University of Economy and Business (WU) 
Victoria Uren Aston University 

Mathias Uslar OFFIS 

Herbert Van De Sompel Data Archiving Networked Services 
Frank Van Harmelen Vrije Universiteit Amsterdam 
Maria Esther Vidal Universidad Simon Bolivar 

Joerg Waitelonis yovisto GmbH 

Shenghui Wang OCLC Research 

Ziqi Zhang Sheffield University 


Additional Reviewers 


Wazed Ali TIB 

Imran Asif Heriot Watt University 
Javad Chamanara L3S 

Andrea Cimmino Arriaga Universidad de Sevilla 
Diego Collarana IAIS Fraunhofer 
Mirette Elias University of Bonn 
Simon Gottschalk L3S 

Prashant Khare The Open University 
Allard Oelen TIB 


Nicolas Tempelmeier L3S 


Contents 


Web Semantics and Linked (Open) Data 


Usage of Semantic Web in Austrian Regional Tourism Organizations ..... . 


Christina Lohvynenko and Dietmar Nedbal 


Test-Driven Approach Towards GDPR Compliance................... 


Harshvardhan J. Pandit, Declan O’Sullivan, and Dave Lewis 


Linked Data Supported Content Analysis for Sociology................ 


Tabea Tietz and Harald Sack 


LinkedSaeima: A Linked Open Dataset of Latvia’s Parliamentary Debates. . . 


Uldis Bojārs, Roberts Dargis, Uldis Lavrinovics, and Pēteris Paikens 


MusicKG: Representations of Sound and Music in the Middle Ages 


ae Linked Open Dass cuca neeatadcanedeoeenyae sade whos gas ae 


Victoria Eyharabide, Vincent Lully, and Florentin Morel 


Machine Learning and Deep Learning Techniques 


Improving NLU Training over Linked Data with Placeholder Concepts... . . 


Tobias Schmitt, Cedric Kulbach, and York Sure-Vetter 


Using Weak Supervision to Identify Long-Tail Entities for Knowledge 


Base COMpPlOUON: casi veceedeaddagetiosabebeurec ssid dacomens 


Yaser Oulabi and Christian Bizer 


Semantic Information Management and Knowledge Integration 


Evaluating Generalized Path Queries by Integrating Algebraic Path 


Problem Solving with Graph Pattern Matching...................... 


Abhisha Bhattacharyya, Ilya Baldin, Yufeng Xin, and Kemafor Anyanwu 


Building a Conference Recommender System Based on SciGraph 


and WIP oe ols eke SYS RRA SR REA RELA ES RAR OES HERSAESEEOISS 


Andreea Iana, Steffen Jung, Philipp Naeser, Aliaksandr Birukou, 
Sven Hertling, and Heiko Paulheim 


V4Ann: Representation and Interlinking of Atom-Based 


Annotations of Digital Content. ssis ceris rinsi eee REE tbtt EAS 


Georgios Meditskos, Stefanos Vrochidis, and Ioannis Kompatsiaris 


19 


34 


50 


ae 


67 


83 


xvi Contents 


RSP-QL*: Enabling Statement-Level Annotations in RDF Streams ........ 140 
Robin Keskisärkkä, Eva Blomqvist, Leili Lind, and Olaf Hartig 


Terminology, Thesaurus and Ontology Management 


The Semantic Asset Administration Shell...............0....000.0005 159 
Sebastian R. Bader and Maria Maleshkova 


Taxonomy Extraction for Customer Service Knowledge 

Base Comme assa Yea weeks BANS A REEE 175 
Bianca Pereira, Cecile Robin, Tobias Daudert, John P. McCrae, 
Pranab Mohanty, and Paul Buitelaar 


An Ontology Alignment Approach Combining Word Embedding 

And the Radnis MEAN, 2.2624 0d aa eeraa Janda be ee ee eee E 191 
Molka Tounsi Dhouib, Catherine Faron Zucker, 
and Andrea G. B. Tettamanzi 


Ontology Design Rules Based on Comparability via Particular Relations.... 198 
Philippe A. Martin, Olivier Corby, and Catherine Faron Zucker 


From Monolingual to Multilingual Ontologies: The Role 

ot Cross-Lingual Ontology Enrichment:...2.. 66204420 eeeee eee nae 215 
Shimaa Ibrahim, Said Fathalla, Hamed Shariat Yazdi, Jens Lehmann, 
and Hajira Jabeen 


MELT = Matching Evaluation Toolkit. cas .i064655 449040 €80 400000 231 
Sven Hertling, Jan Portisch, and Heiko Paulheim 


Data Mining and Knowledge Discovery 


Interaction Network Analysis Using Semantic Similarity Based 
on Translation Embeddings: 24.6 o0602446 656458004 e a0 iii triadi 249 
Awais Manzoor Bajwa, Diego Collarana, and Maria-Esther Vidal 


CACAO: Conditional Spread Activation for Keyword Factual 
Query merme eUn <ps.cc ada see eeeeen de ede See sbe nose Foe Se SSX 256 
Edgard Marx, Gustavo Correa Publio, and Thomas Riechert 


Fine-Grained Named Entity Recognition in Legal Documents............ 272 
Elena Leitner, Georg Rehm, and Julian Moreno-Schneider 


Extracting Literal Assertions for DBpedia from Wikipedia Abstracts....... 288 
Florian Schrage, Nicolas Heist, and Heiko Paulheim 


Contents 


Towards a Scalable Semantic-Based Distributed Approach 


for SPAROL Query Evaluation: osi 4i4 6.5506 tt G4 cae ee REL E RES 


Gezim Sejdiu, Damien Graux, Imran Khan, TIoanna Lytra, 
Hajira Jabeen, and Jens Lehmann 


Automatic Facet Generation and Selection over Knowledge Graphs. ....... 


Leila Feddoul, Sirko Schindler, and Frank Löffler 


Knowledge Graph Exploration: A Usability Evaluation of Query 


Builders tor Lay peoples scan d dc Sb a khai ti aada aTa 


Emil Kuric, Javier D. Fernandez, and Olha Drozd 


QUANT - Question Answering Benchmark Curator .................. 


Ria Hari Gusmita, Rricha Jalota, Daniel Vollmers, Jan Reineke, 
Axel-Cyrille Ngonga Ngomo, and Ricardo Usbeck 


Simple-ML: Towards a Framework for Semantic Data 


Analytics WOrOWss ceri 2 eGR eS ee aA oe EE ER RREER HES EWEEK ES 


Simon Gottschalk, Nicolas Tempelmeier, Günter Kniesel, 
Vasileios losifidis, Besnik Fetahu, and Elena Demidova 


Semantics in Blockchain and Distributed Ledger Technologies 


Incorporating Blockchain into RDF Store at the Lightweight 


Edje DE o64 ieee e cae eae debra deed CALE SS OE SAGES SEES 


Anh Le-Tuan, Darshan Hingu, Manfred Hauswirth, and Danh Le-Phuoc 


Verifying the Integrity of Hyperlinked Information Using Linked Data 


and Smart COMPACTS + os 5644554284 0b GEESE ROR ROSEN EEA ROSES 


Christoph Braun and Tobias Käfer 


AME TOM os EERTE SENEE ETES ELA SOS SRRESREZSES EE RSE ES 


xvii 


Web Semantics and Linked (Open) Data 


®) 


Check for 
updates 


Usage of Semantic Web in Austrian Regional 
Tourism Organizations 


Christina Lohvynenko and Dietmar Nedbal © 


University of Applied Sciences Upper Austria, 
Wehrgrabengasse 1-3, 4400 Steyr, Austria 
christina-lohvynenko@gmx. at, 

dietmar. nedbal@fh-steyr. at 


Abstract. Tourism is one of the most important economic sectors in Austria. 
Given the high internationality degree of Austrian visitors, the websites of 
regional tourism organizations (RTOs) are an essential source of information. 
A state-of-the-art tourism website should include semantic markup for touristic 
topics so that search engines and other intelligent software applications can 
access and understand the presented data. This paper empirically studies the 
usage of Semantic Web formats, ontologies and topics relevant for tourism on 
the websites of all 137 Austrian RTOs. Results show that 59% of the RTOs use 
semantic markup. Most regions adhere to the recommendations of leading 
search engines utilizing ontologies such as Schema.org and the formats 
Microdata and JSON-LD. While most semantic markup incorporates basic 
information (e.g. navigation, addresses, corporate data), only few Austrian 
RTOs annotate touristic relevant topics that would contribute to unlock the full 
potential of the Semantic Web such as regional events, accommodations, blog 
posts, images or social media. 


Keywords: Semantic Web - Regional tourism organizations > Survey - Austria 


1 Introduction 


With nearly 45 million resident and non-resident guests in 2018, tourism is one of the 
most important Austrian economic sectors [1]. In the last years, the tourism and leisure 
industry contributed around 16% to the Austrian gross domestic product through direct 
and indirect effects [2]. Even in international comparison, the country occupies an 
important place among the top 20 tourism destinations [3]. The tourism regions, which 
are in the midst of the hierarchical organization of this industry in Austria, contribute 
significantly to the promotion of certain tourism destinations and to addressing a broad 
target group [4]. These regional tourism organizations (RTO) are also given an 
important role in the possible weakening of dependence on international online travel 
agencies (OTA), which dominate the tourism market. Given the growth of the Internet 
usage and due to the high internationality degree of Austrian visitors, the websites of 
tourism providers are becoming increasingly important. A state-of-the-art website that 
implements innovative web technologies is therefore essential [5, 6]. 
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The use of Semantic Web and Linked Data has long been a standard in website 
optimization and intends to make important content-bearing elements of web pages 
machine-readable by means of semantic markup so that access to data for search 
engines and other intelligent software applications is facilitated. The semantic anno- 
tation of structured data to a website is one of the most common search engine opti- 
mization practices, which is also recommended by leading search engines. Thus, it can 
increase the online visibility of the web page and the sales figures on the Internet [7—9]. 
However, the empirical analysis of the use of Semantic Web by the hotel websites in 
Austria has shown that the use of direct providers in contrast to OTAs is very moderate 
and often flawed [6, 10]. Such a weak use of structured data in the hotel industry 
suggests that the Semantic Web has not yet become a standard in Austria’s tourism 
industry. 

With the RTOs playing an important role in the Austrian tourism, the current paper 
aims to elucidate the usage status of the Semantic Web among these websites. It first 
discusses the background and related work on the use of structured data in tourism in 
Sect. 2. Further, the results on an empirical investigation are reported. For this purpose, 
the selection of the examination objects and preparation of the data for analysis are 
described in Sect. 3. The results of the evaluation are presented in Sect. 4, followed by 
a discussion (Sect. 5). Finally, Sect. 6 provides concluding remarks. 


2 Background and Related Work 


One of the most important communication channels of a tourism organization is the 
website, which should adhere the current state-of-the-art. In this context it has been 
recognized that innovative software providing interoperability through ontologies is 
critical for further innovation in the tourism industry [11]. Although there has been 
progress in the last ten years, a recent study highlights the still current and growing 
importance of semantics and ontologies in tourism. The authors further state that 
academic research in these disciplines is still in its infancy [12]. 

Website owners and content managers of tourism regions face several challenges 
when attempting to semantically enrich data on their website. First of all the selection 
of the appropriate vocabulary, format and content is not a trivial task. In addition to 
common vocabularies independent of the domain, several domain-specific ontologies 
for tourism have also been developed which makes it difficult to select the most suitable 
and, at the same time, a future-proof vocabulary. The Linked Open Vocabularies 
project, for example, provides a central information point about well-documented 
vocabularies [13]. The constantly growing website lists 660 high quality vocabularies 
as of Feb. 2019. Measured by the number of vocabularies that reuse the vocabulary, the 
most popular ontologies are Dublin Core Metadata Terms (dcterms), Dublin Core 
Metadata Element Set (dce), Friend of a Friend vocabulary (foaf), A vocabulary for 
annotating vocabulary descriptions (vann), Simple Knowledge Organization System 
(skos), Creative Commons Rights Expression Language (cc), SemWeb Vocab Status 
ontology (vs) and Schema.org vocabulary (schema) [14]. The problem of common 
vocabularies often lies in the level of precision over domain-specific ontologies. For 
example, until version 3.0, Schema.org lacked the ability to describe the number of 
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beds in a room, or whether pets are allowed or not [15]. One of the main goals of 
tourism-specific vocabularies is to achieve a better interoperability and integration of 
travel information systems [16]. Several researches have focused on the design of 
semantic vocabularies for the tourism and travel industry [17] (e.g. Harmonise [18], 
QALL-ME [19], cDott [20], Accommodation Ontology [21], Tourpedia [22]). 

Given the amount and diversity of available ontologies, an industry wide adoption 
is crucial for a future-proof vocabulary. The Web Data Commons project features the 
largest publicly available collection of structured data from a non-profit organization 
[23], allowing researchers to analyze the adoption of structured data across the Web. 
An analysis for the period 2010 to 2013 showed that the use of the Semantic Web, its 
formats and data classes has been steadily increasing. The comparison of the 2012 and 
2013 datasets revealed that the number of websites using Microdata has even grown by 
more than factor four in just one year. The topics that received the most attention 
through semantic markup were people and organizations, blog articles, navigation 
information, product and event data [23]. In another study focusing only on the 
adoption of Schema.org, it was shown that about half of the elements of this vocabulary 
have not been used in any of the websites from the Web Data Commons dataset [24]. 

Since a website is one of the most important means of communication for tourism 
organizations, several studies have addressed the quality of touristic websites. Inter- 
national online travel agencies have heavily dominated the tourism sector in recent 
years. Tourism organizations in Austria are also suffering from this online competition 
and are trying to counteract this competition by means of innovative technologies and 
intelligent advertising of products and services on several channels. When comparing 
the quality of content and services offered on official websites of tourism organizations 
with online travel agencies’ websites, OTA websites have often received better results. 
Tourism websites often do not follow state-of-the-art online developments, therefore 
OTAs have the lead in terms of technology usage, according to the studies [6, 10, 25]. 
As far as Austria is concerned, studies in recent years have distinguished a good 
performance and numerous innovative integrated services on the websites of official 
Austrian tourism organizations in international comparison [26, 27]. 

The use of well-documented structured markup should enable error-free annotation 
and improve the quality of the website. Unfortunately, a large variety of erroneous and 
restricted usage in the semantic markup are made in practice when using vocabularies 
like Schema.org, which hinders real-life applications to use the data [10, 28]. To 
counteract this problem, Simsek et al. described an approach that validates Schema.org 
markup in terms of completeness of the annotations for a specified domain and 
semantic consistency [29] that was implemented in an online-tool semantify.it [30]. 

Benefits when using Semantic Web technology include better visibility in the 
search results of leading search engines [7], as well as better online visibility of the 
promotions being advertised [5]. This further helps reducing reliance on OTAs, enables 
the use of structured data by emerging intelligent applications (e.g. chatbots and voice 
search) and improves interoperability among market participants [31-33]. 

The literature review has shown that the topic of using the Semantic Web has a long 
history and great potential for the industry. Studies indicate that the tourism sector often 
lacks expertise and knowledge of the correct use of Semantic Web technology. Fur- 
thermore, research on the use of semantic technologies in Austrian tourism organizations 
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focuses mostly on either the hotel sector or individual tourism organizations. A recent 
study of the usage of Semantic Web comprising all Austrian regional tourism organi- 
zations could not be identified during the literature search. 


3 Methodology 


The methodology for the empirical investigation started with a definition and selection 
of the examination objects. This is followed by a description of the data extraction 
process and the preparation of semantic markup for the actual analysis. It is also 
detailed, how incomplete and erroneous annotations were identified and how they were 
assigned to groups that emerged during this analysis. 


3.1 Selection of the Examination Objects 


Austrian regional tourism organizations are well suited as examination objects for this 
analysis, as they usually have an established website with comparable contents of the 
region. However, the number of these organizations is not constant in Austria, which 
makes objective analysis more difficult. 

The organization of Austrian tourism has a hierarchical structure. The basis of the 
tourism market is provided by the 65,000 tourism businesses, most of which operate in 
municipalities that are classified as tourism-intensive municipality with at least 1,000 
overnight stays per year. Of the 1,568 Austrian tourism-intensive municipalities, 151 
were categorized as tourism regions in 2008 [34]. At the state level, tourism in Austria 
is divided into the respective offices of the nine state governments with one national 
tourism organization (“Austrian National Tourist Office”) on the top, working closely 
together with the tourism regions. Therefore, in this work, the tourism regions together 
with the nine state tourism organizations and the national tourism organization are 
referenced to as regional tourism organizations (RTO) in the following. 

As mentioned, the number of RTOs varies over time. For example, in Upper 
Austria, a new tourism law came into force, according to which the number of tourism 
associations (and thus also the RTOs) must be reduced from 100 to 20 by the year 
2020. There are tourism associations that have already merged, but still have separate 
websites (e.g. “Wels” and “Sattledt”) and others that have no joint website (e.g. 
“Oberes Mühlviertel”) as of June 15, 2018. For this research, the list of RTOs to be 
examined has been determined in a top-down approach. Starting from the actual ref- 
erences on the nine state tourism organizations websites, an initial list of 117 regional 
websites was gathered (3 organizations in Burgenland, 6 in Lower Austria, 26 in Upper 
Austria, 14 in Carinthia, 17 in Salzburg, 9 in Styria, 35 in Tyrol, 6 in Vorarlberg). After 
examining the individual websites of these 117 organizations, the following changes 
were made: Two Upper Austrian RTOs without own website (“Nationalpark Region 
Ennstal” and “Steyrtal”) were removed and RTOs with separate individual websites 
were added in Carinthia (1 RTO split into 3 websites), Styria (1 RTO split into 2 
websites), and Tyrol (2 RTOs split into 7 websites). In total, 133 websites (one 
national, nine state and 123 regional tourist organization websites) were included, all of 
which are subsequently referred to as RTO. 


Usage of Semantic Web in Austrian Regional Tourism Organizations 7 


3.2 Data Extraction Process 


For this research we used data from Web Data Commons [23], making raw web page 
data, extracted metadata, and snippets of individual web pages available to the public. 
The data collection entitled “WDC RDFa, Microdata, Embedded JSON-LD, and 
Microformats Data Sets (November 2017)” was used as basis for data extraction. The 
original record contains 8,433 files, each around 100 MB in size. The data in the 
collection is represented in the form of RDF quads with subject, predicate, and object as 
well as the URL of the web page from which the data was extracted as fourth element. 

With the help of a shell script, the downloaded files were unpacked and examined 
for the presence of semantic annotations of one of the 133 defined RTOs. The script 
generates plain text files and can be downloaded from the URL https://tlp.de/shellscript 
. The duration of the script was approximately 48 h, with ten tasks run simultaneously 
on several machines. 


3.3 Preparation of Semantic Markup 


The preparation of the data for the actual evaluation was done using Microsoft Excel 
2016. The first step was to create 133 Excel spreadsheets, one for each tourism region 
from the text files generated by the shell script using an Excel macro. With the help of 
conditional formatting, regular expressions, and filtering rules in Excel, duplicated 
annotations and mentions were removed (repeated use of the same annotation on the 
same web page) and the markup of all subdomains of the respective RTO were checked 
and adjusted if necessary. Thus, only those data remained, where the fourth part of the 
RDF quad contained the domain of one of the 133 RTOs defined. 

After all tables were cleaned up with irrelevant data, all individual tables containing 
structured data were combined in two files (one own Excel file containing “wien.info” 
markup and one for the remaining 77 websites). This subdivision was necessary due to 
the limited number of rows in this version of Excel. 

In order to be able to identify different types of structured data in websites of 
Austrian tourism regions, the table has been extended with additional information. The 
final analysis table can be downloaded from the URL http://tlp.de/analysistable as 
Microsoft Excel file. It contains the following columns: 


e The first column contains the relevant RDF quads (430,894 in Vienna and 769,824 
in the file for all remaining regions). 

e The second column (“Region”) contains the domain of the respective tourism 
region, gathered from the URL. 

e The third column (“Federal State”) allows the assignment to one of the nine federal 
states and to the national tourism organization of Austria (austriatourism.com). 

e The fourth column “Format” contains the format used for a specific semantic 
markup. This information was taken from the Web Data Commons file name from 
which the respective RDF quad was extracted (e.g. file “dpef.html-embedded- 
jsonld.nq” contains the semantic annotations carried out by JSON-LD). 
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e The “Namespace & data type” column represents the predicate of the respective 
triples and contains, in addition to the namespace of the ontology, the names of the 
data classes and data properties used. The namespaces were determined by means of 
the Excel filter function from the first column containing the RDF quads. 

e The “Ontology” column captures the name of the ontology, which was determined 
by the namespace in the “Namespace & data type” column. 

e The “Class” column contains the data classes used and the “Property” column lists 
the data properties used by the RDF quad. The data on classes and properties was 
determined using the Excel filter function from the RDF quads themselves or from 
the “Namespace & data type” column. 

e The “Topics” column contains aggregated information of the data classes from 
various ontologies used into subject areas, containing similar or related objects (cf. 
Sect. 4.4). 

e The last column “Remark” was used to take notes about found errors or incomplete 
semantic annotations, most of which were previously described in the study of 
Meusel and Paulheim [28]. Mistakes found include missing slash, incorrect upper or 
lower case, missing or incorrect use of a data types, incorrect use of namespace, 
property mapped to an incorrect class or data type, incorrect use of property values, 
and incomplete/wrong specification of namespace. 


4 Analysis Results 


This section contains the main findings of the survey on the use of Semantic Web 
technology by Austrian RTOs. First, an overview of the top 20 RTOs using semantic 
markup is given. This is followed by a brief analysis of the formats and ontologies 
used. Finally, insight into the topics that were annotated by the RTOs is provided. 


4.1 Amount of RTOs Using Semantic Annotations 


A total of 78 Austrian RTOs (59%) use Semantic Web annotations in their websites, 
while the remaining 55 RTO websites did not show any semantic markup in the course 
of this analysis. 

Figure 1 shows the top 20 RTOs, measured by the absolute number of RDF quads 
identified. The leading RTO is Vienna (domain: wien.info), which has 430,894 RDF 
quads integrated into its website. Second place in this ranking is occupied by ziller- 
talarena.com with 129,320 RDF quads. The other 18 RTOs shown in the figure each 
use more than 10,000 RDF quads. The structured data from wien.info alone make up 
36% of the entire data set; zillertalarena.com added another 11% and the remaining 18 
RTOs from the top 20 list sum up to 42% of all annotations. The top 20 regions thus 
make 89% of the total amount of semantic markup. 
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4.2 Formats 


The use of the Semantic Web formats shows a clear preference of the Microdata format 
(93.9%) by the number of absolute uses in the RDF quads. JSON-LD was used in 3% 
and microformats in 2.8% of the RDF quads. The use of RDFa is only at 0.3% and 
includes almost only the Open Graph protocol (OGP). 

53.8% of the 78 RTOs with structured data use Microdata as the only format for 
semantic annotation of website content. The use of multiple formats by RTO is 
heterogeneous: 10.3% use Microdata and Microformats at the same time, another 9% 
Microdata and JSON-LD. The three formats Microdata, Microformats and JSON-LD 
are simultaneously used by 7.7% of the RTOs. RDFa alone is used by four RTOs 
(5.1%). All four formats are used by three RTOs (3.8%). The remaining 10.3% of the 
RTOs use a combination of five different formats. 


1. wien.info 430.894 
2. zillertalarena.com 129.320 
3. kitzbueheler-alpen.com 59.929 
4. montafon.at 51.787 
5. wilderkaiser.info 48.978 
6. innsbruck.info 42.308 
7. gastein.com 41.380 
8. lech-zuers.at 33.825 
9. grossarltal.info 27.838 
10. weinviertel.at 27.527 
11. neusiedlersee.com 23.944 
12. best-of-zillertal.at 18.325 
13. millstaettersee.com 18.262 


14. mayrhofen.at 17.425 

15. kufstein.com 17.391 

16. kaiserwinkl.com 16.215 

17. kitzbueheler-alpen.com/st-Johann 15.316 
18. kitzbuehel.com 15.273 

19. wienerwald.info E 14.352 

20. kaernten.at 13.318 


0 100,000 200,000 300,000 400,000 


Fig. 1. Top 20 Austrian RTOs by absolute number of RDF quads. 


4.3 Structured Data Markup: Ontologies 


The examined websites use a total of eight different ontologies. The most used 
ontology is Schema.org with 63.7% by the number of absolute uses in the RDF quads. 
In second place (18.2% of the RDF quads) is Data Vocabulary. Dublin Core terms are 
used by a large number of RTOs (61 websites) but account to only 3.3% of the overall 
RDF quads. The remaining four ontologies (hCard, OGP, iCal Schema, XFN, FOAF) 
are all referenced by less than 3% semantic markup. Interestingly, none of the 
vocabularies developed specifically for tourism were found in the examined objects. 
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4.4 Topics 


Since same or similar content can be annotated using various ontologies and data 
classes, an overview of the topics that have been covered by the RTOs needs additional 
consolidation. For this reason, the thematically related objects of a tourism site website 
were subsequently grouped into similar topics, representing subject areas or categories. 


Table 1. Topics and their associated ontologies and data classes. 


# | Topic Ontologies and data classes 

1 | Addresses s:GeoCoordinates, s:PostalAddress, vcard:Address, vcard:adr, vcard: 
addressType, vcard:country-name, vcard:email, vcard:locality, vcard: 
postal-code, vcard:region, vcard:street-address, vcard:tel 


2 | Blogs s:Article, s:Blog, s:CreativeWork, s:BlogPosting, vcard:family-name, 
veard:fn, vcard:given-name, vcard:n, vcard:Name, vcard:nickname, 
vceard:note, vcard:title, vcard:url, vcard:vcard 


3 | Navigational dv:Breadcrumb, s:BreadcrumbList, s:ItemList, s:ListItem, s:url, 
Information s:SiteNavigationElement, s:WPFooter, s:WPHeader 
4 | Organization dv:Organization, s:Organization, vcard:org, vcard:Organization, 
vceard:organization-name, vcard:uid 
5 | People Foaf:Person, s:JobPosting, s:Person 
6 | Product Data s:AggregateOffer, s:AggregateRating, s:Hotel, s:BedAndBreakfast, 


s:LocationFeatureSpecification, s:LodgingBusiness, s:Offer, 
s:Product, s:Date, s:Property Value, s:Rating, s:Reservation, s:Review, 
veard:fn, veard:n 


7 | Action s:SearchAction 


8 | Event dv:Event, iCal:component, iCal:description, iCal:dstart, 
iCal:summary, iCal:vcalender, iCal:Vevent, s:Event, s:Place, vcard: 
fn, veard:n, vcard:url, vcard:vcard 


9 | Images s:ImageGallery, s:ImageObject, vcard:photo 

10 | Local Tourism s:Campground, s:GolfCourse, s:LocalBusiness, s:Place, 
Business s:TouristAttraction, s:TouristInformationCenter 

11 | Social Media dc:source, og:admins, og:app_id, og:description, og:fbmladmins, 


og:image, og:site_name, og:title, og:type, og:url, s:sameAs, 
xfn:mePage, xfn:me-hyperlink 

12 | Website de:title, s: Language, s: WebPage, s:WebSite 

Information 


Table 1 presents the twelve topics identified during the analysis, including the list 
of data classes that make up each group. The first six topics were taken from the study 
of Meusel et al. [23]. The remaining groups were defined on the basis of the examined 
data of the RTOs. The ontologies are abbreviated as follows: “s:” stands for Schema. 
org, “dv:” for Data Vocabulary, “dc:” for Dublin Core, and “og:” for OGP followed by 


the respective data class. 
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The subdivision into these twelve topics unfortunately does not guarantee that there 
is no overlapping in the content. For example, many blog articles contained informa- 
tion on tourist attractions (topic “Local Tourism Business”), pictures in the category 
“Images” were occasionally identical to the image properties of individual topics such 
as “Organization”, “Event”, “Local Tourism Business”, or “Blogs” and several classes 
are also described by properties that contain address information. The Schema.org class 
“s:Place” has been divided manually into two topics: on the one hand in “Event”, if the 
information was about an event location, and on the other hand in “Local Tourism 
Business”. The analysis of the use of topics is presented in Table 2; details on the 
topics are presented in the following. 


Table 2. Use of topics by the 78 RTOs using semantic annotations. 


Topic RDF quads RTOs 

Navigational Information | 398,947 (33.2%) | 41 (52.6%) 
Addresses 176,755 (14.7%) | 35 (44.9%) 
Local Tourism Business | 134,577 (11.2%) | 20 (25.6%) 
Event 94,827 (7.9%) |20 (25.6%) 
Product Data 63,670 (5.3%) 24 (30.8%) 
Website Information 63,130 (5.3%) 68 (87.2%) 
Blogs 52,307 (4.4%) |29 (37.2%) 
Organization 24,182 (2.0%) |29 (37.2%) 
Images 22,301 (1.9%) 13 (16.7%) 
Social Media 21,799 (1.8%) |20 (25.6%) 
Action 4,837 (0.4%) 15 (19.2%) 
People 1,446 (0.1%) 10 (12.8%) 


Navigational Information. Every third semantic markup is made for the purpose of 
presenting the breadcrumb and list items that help navigate the website. Nearly 56% of 
this topic is annotated using Schema.org and 44% using Data Vocabulary. Only about 
0.1% of the markup is made using JSON-LD and Microdata. A total of eight RTOs 
account for 81% of the data in the category, of which RTO “zillertalarena.com” alone 
uses 40% of the annotations. Most commonly used are the classes “dv:Breadcrumb” 
and “s:SiteNavigationElement”’. 


Addresses. Almost 15% of the markup contains various address details. The annotations 
use Schema.org and Microdata in 96% of the cases, the remainder is annotated using the 
Microformat hCard. 41% of the RTOs annotate address data of the region where the 
company or local providers are located; the exact address (either street and house 
number or latitude and longitude) is awarded by 45% of the RTOs. 15% of the RTOs use 
this topic for specific contact information such as telephone, fax, e-mail or URL. 
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Local Tourism Business. 11.2% of the RDF quads represent information on this topic. 
Four RTOs (wien.info, weinviertel.at, innsbruck.info, gastein.com) contribute 84.1% of 
the data in this topic. The only ontologies used here are Schema.org and Microdata. 


Events. Almost 8% of the data represent events in the region. Annotations are made at 
98% by means of Schema.org and Microdata, the remainder by the Microformats 
hCalender and hCard. The most used property is the start date of an event, followed by 
the name, image, location, URL, description, address and the special offers. Overall, 
only two RTOs (wien.info and lech-zuers.at) have made 87.3% of all annotations in 
this topic. 


Product Data. This topic describes both the “Product” and “Offer” data classes as well 
as various types of accommodation that can be considered as the product of an RTO. 
5.3% of all RDF quads found are subsumed under this topic. RTOs adopted Schema. 
org and Microdata ontologies. Most used annotations (over 1,000 each) include the 
LodgingBusiness, AggregateRating, LocationFeatureSpecification, Offer, Hotel, Pro- 
duct, and Review classes. Three RTOs (wien.info, montafon.at, kitzbuehel.com) made 
a total of 91% of all semantic markup of this topic. 


Website Information. This topic describes various elements such as the title, alter- 
native names, languages used and individual elements of a website. 62% of the RDF 
quads were annotated using Dublin Core, the rest by means of Schema.org. The use of 
Microdata dominated the format use (93%), with JSON-LD making up the remaining 
7%. Although 68 RTOs are using this topic, more than half of the RDF quads in this 
category were annotated by wien.info. 


Blogs. In this section, blog, press and web pages published on the website, including 
author data, titles, descriptions and evaluations, are subsumed. Four regions (best-of- 
zillertal.at, wien.info, mayrhofen.at and grossarltal.info) out of 29 make 81% of all 
RDF quads of this topic. Almost half of all annotations are made using Schema.org and 
Microdata, the rest using hCard. Typical semantic information include headline, 
description, author name and URL. 


Organization. This topic is used to present information about the website operator 
such as name, logo and VAT number. 96% of the annotations are done using Schema. 
org, the rest using Data Vocabulary and Microformats. Microdata is used in 69% of 
annotations, followed by JSON (28%) and the Microformat hCard (4%). The use of 
this topic is dominated by four RTOs (nationalpark.at, oetztal.com, stantonamarl- 
berg.com and neusiedlersee.com). 


Images. This topic contains various pictures and collections of pictures. 99% of the 
annotations use Schema.org (mainly Microdata), the rest the Microformat hCard. Four 
RTOs (kaernten.at, kitzbuehel.com, montafon.at and tennengau.com) account to 85% 
of all annotated images. 


Social Media. Social media annotations are made using four different ontologies 
(primarily OGP and Schema.org, but also Dublin Core and XFN) in all four formats. 
The most common purpose is to link to the social media presence: 10 RTOs link to 
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their page on Facebook, five on Instagram, four on YouTube, three on Google+, two on 
Twitter, and one each on Pinterest and Flickr. Almost 70% of all annotations were 
made by the RTO neusiedlersee.com. 


Action. This topic is used to mark the entries in the search fields or forms that are used 
by the search engines primarily to provide users with an opportunity to search the 
content of a website directly on the search results page in their own search window. 
Four RTOs (grossarltal.info, austriatourism.com, reutte.com and bregenzerwald.at) out 
of 15 account for 91.5% of the markup in this topic, which are made exclusively using 
JSON-LD and Schema.org. 


People. This topic subsumes individuals (article authors, team members, etc.) and 
company job offers. Most annotations are based on Schema.org and Microdata. Three 
out of ten RTOs (lech-zuers.at, hoch-koenig.at and mayrhofen.at) make up 94% of all 
RDF quads in this topic. 


5 Discussion 


The analysis revealed that the use of Semantic Web in Austrian RTOs complies with 
the recommendations of leading search engines such as Google, Yahoo, Bing and 
Yandex. The majority of semantic annotations by tourism regions are made using 
Microdata and JSON-LD. In addition, considering a total of eight ontologies that are 
used, the recommended Schema.org is preferred, along with its predecessor, Data 
Vocabulary, in over 80% of all annotations. 

The grouping of semantic markup in twelve thematically related topics allowed an 
overview of all structured data specifically for Austrian tourism regions - regardless of 
the formats and ontologies used. The analysis showed that, with the exception of the 
three general topics (“Navigational Information”, “Addresses”, and “Website Infor- 
mation”), the annotation of RTO’s specific tourism information is strongly influenced 
by only a few RTOs. While general information is important to search engines as well 
as various software agents, specific tourism content should also be semantically 
annotated to exploit the full potential of the Semantic Web. 

For tourism, relevant Schema.org classes and properties are distributed in different 
parts of this ontology [16]. However, Austrian RTOs use only a few data types and 
properties of Schema.org intended for the tourism industry. For example, no annotations 
for food establishments (“FoodEstablishments” class with possible types “Bakery”, 
“BarOrPub”, “Brewery”, “CafeOrCoffeeShop”, “FastFoodRestaurant”, “IceCream- 
Shop”, “Restaurant”, “Winery”, etc.) or ski resorts (“SportsActivityLocation”, “SkiR- 
esort” classes) were found, although such content is available on the websites. 

The analysis of the topic “Product Data” revealed that the possibility of specifying 
specific types of accommodation are hardly used by the RTOs. The Schema.org type 
“LodgingBusiness” can be used, for example or the more specific subtypes “Hostel”, 
“Hotel”, “Motel”, “Resort”, “Campground”, or “BedAndBreakfast’”. The three types 
“Hotel”, “Campground” and “BedAndBreakfast” together with the type “Loca- 
tionFeatureSpecification” are only used by one RTO (montafon.at). Furthermore, none 
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of the RTOs annotate specific events such as “MusicEvent”, “SocialEvent”, 
“SportsEvent”, etc. Nevertheless, a precise classification is particularly important for 
tourism organizations for all available content and such generic classes should be 
avoided [32]. 

Detailed information on accommodations that are relevant for a user’s booking 
decision and also contribute to specific search results (e.g. Schema.org properties like 
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“amenityFeature”, “availability”, “price”, “offer”, “paymentAccepted”, “petsAllowed”, 
“priceCurrency”, “priceRange”, “availability”) were used by 13 RTOs. Taking a closer 
look, 92% of RDF quads with such detailed information came from only one region 
(montafon.at). The remaining twelve RTOs used the properties mentioned only spo- 
radically. As a result, applications need additional data extraction and fusion techniques 
to understand the content of these sites (e.g. to find out which RTO offers a specific 
type of accommodation with specific equipment). Thus, the integration of multiple data 
items representing the same real-world object into a single, consistent, and precise 


representation remains challenging [9]. 


6 Conclusion 


The present work empirically studies the use of structured data on the websites of 
Austrian tourism regions. According to the results of this analysis, 59% of the tourism 
organizations surveyed use the Semantic Web, which is a high ratio in international and 
industry comparison. However, the use is designed according to the Pareto principle: 
20% of the tourism regions account for 82% of all semantic markup. Most tourism 
regions adhere to the recommendations of the search engines and use the ontology 
Schema.org and the formats Microdata and JSON-LD. While semantic markup of basic 
information such as addresses, corporate and website data is necessary, many areas that 
would contribute to unlock the full potential of the Semantic Web are neglected by 
Austrian RTOs. The use of touristic relevant topics, such as regional events, accom- 
modations, blog posts, images or social media is dominated by a few RTOs. None of 
the special tourism ontologies were applied and also only a few classes and properties 
that are typical for this type of industry are used by a large number of tourism regions. 
Many tourism-relevant data, such as points of interest, ski resorts, user reviews, 
restaurants, job descriptions, accommodation equipment including dynamic content 
such as prices or availability is available on websites, but are only used sporadically by 
RTOs. Despite the comparable contents on the websites of RTO and a common 
objective to achieve the highest possible online visibility and better presentation in the 
search results and thus a higher booking and attendance rate, the usage scenarios of 
Semantic Web differ in Austrian tourism regions. 

The findings of this study are based on a secondary source. This implies that the 
number of items of investigation was limited from the start. It has not been investigated 
whether the sites selected for this analysis were included in the original 3.2 billion site 
list. In addition, only the websites with a maximum of four website navigation levels 
were included in the original data set. The original record may also exclude websites 
that prohibit the browsing of their contents by the unknown web crawlers, which was 
also not checked during this analysis. Furthermore, the structured data was extracted 
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from the dataset for November 2017 at a single point in time, making it impossible, for 
example, to check some records in real time. An interesting research approach for the 
future would be to repeat the same study at a periodic interval to see if the use of 
Semantic Web technology has changed over time. 

Another limitation of this study is the fact that several errors in the semantic 
annotations on the websites were found when preparing the source data for analysis. 
Such mistakes not only complicate data analysis but also may fail the very purpose of 
structured data. Since systematically error detection was not subject of this work, these 
may bias the analysis results through wrong classification or incorrect detection of 
semantic markup. Future research should focus more on error analysis in semantic 
annotations and how these errors could be avoided (e.g. through semantic annotation 
tools). 

The analysis results may have further been influenced by the non-differentiation of 
language variants of a website. Thus, tourist regions with a large number of indexed 
pages on search engines, representing many touristic objects in multiple languages 
show better results in this analysis. In addition, the proportion of structured data that 
was used only on the subdomains of the websites of RTO has not been determined. It is 
thus possible that a whole tourist region shows better results, even though semantic 
annotations were only made on a few subdomains. Thus, an international comparison 
that copes with different languages and/or subdomains would be of interest. This would 
allow identifying best practices and recommended actions specifically for the tourism 
organizations in a certain country. 

Even though tourism-specific semantic markup is not widely used in Austrian RTO 
websites, it can be expected that with the increasing spread of intelligent web appli- 
cations and services, more and more content owners will deal with this subject. A better 
visibility of the services and offers of the touristic region through semantic annotations 
helps in the dissolution of dependence on international online intermediaries and 
should therefore be more widespread in the websites of Austrian tourism organizations. 
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Abstract. An organisation using personal data should document its 
data governance processes to maintain and demonstrate compliance with 
the General Data Protection Regulation (GDPR). As processes evolve, 
their documentation should reflect these changes with an assessment 
showing ongoing compliance. Through this paper, we show how seman- 
tic representations of processes are useful towards maintaining ongoing 
GDPR compliance by using a test-driven approach that generates and 
checks constraints for adherence to GDPR requirements. We first check 
whether all required information has been documented, and then whether 
it is compliant. We prototype our testing approach using a real-world 
website’s consent mechanism for GDPR compliance, and persist results 
towards generating documentation. We use previously-published ontolo- 
gies to represent processes (GDPRov), consent (GConsent), and GDPR 
(GDPRtEXT), with SHACL used to test requirement constraints. 
Paper and Resources: https://w3id.org/GDPRep/semantic-tests. 
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1 Introduction 


Demonstrating compliance towards the General Data Protection Regulation 
(GDPR) [17] requires documenting information regarding how its various obli- 
gations and requirements were met. GDPR explicitly requires documentation of 
information for records of processing activities (R82, A30), consent (R42, A7-1), 
and impact assessment (DPIA (A35)). It also requires controllers to implement 
and periodically review appropriate measures regarding processing (A5-1, A24). 
Therefore the process of assessing, maintaining, and demonstrating compliance 
with the GDPR is tightly coupled with operational workflows involving personal 
data. 

Processes change and evolve over time - such as the purpose may change, or 
the same process is used for other additional purposes, or the assigned processor 
changes. For GDPR compliance, each such change needs to be documented as a 
temporally versioned record of processing to demonstrate compliance regarding 
processing activities at that period in time. It would be considered prudence or 
good practice to show that the specific change was assessed and verified to be 
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compliant before proceeding with it. This is mandatory under GDPR for certain 
situations requiring a DPIA (A35). 

Semantics, and by extension the semantic-web, has been demonstrated to be 
of assistance in the management of GDPR compliance. Existing work addresses 
modelling machine-readable metadata for compliance [8,11,13,14], querying 
for compliance-related information [16], and maintaining compliant processing 
logs [8]. Interoperable semantics are beneficial when information is shared between 
stakeholders such as - controllers and processors, or controllers and certification 
bodies or supervisory authorities. The interoperability is also helpful towards 
transparency regarding processing activities to address the discrepancy between 
requirements of an organisation and compliance [18]. A discussion of four areas 
where automation can be applied [7], one of which is compliance using checklists, 
shows possible avenues for further incorporating semantics into the compliance 
process. 

In this paper, we show how semantic representation of processes are useful 
in a test-driven approach for documenting ongoing compliance with the GDPR. 
We describe our approach towards generating and testing constraints based on 
requirements gathered from GDPR and the use of semantics to generate docu- 
mentation linked with the GDPR. The paper also presents an application of this 
approach by testing a website’s consent mechanism for GDPR compliance and 
generating compliance documentation. For this, we build on our previous work 
including ontologies to represent processes (GDPRov [14]), consent (GConsent 
[12]), and GDPR (GDPRtEXT [13]), and an approach to turn compliance ques- 
tions into semantic queries [16]. An overview of this was presented in a prior 
publication [15]. 


2 Approach 


2.1 Generating Constraints from Requirements 


The first step towards compliance is selecting applicable clauses from the GDPR 
and converting them into tangible requirements. Resources useful for this include 
information and guidance provided by Data Protection Authorities and profes- 
sional institutes. Information pertaining to the fulfilment of these requirements 
is required for compliance documentation. 

The next step is to identify information required to assess whether require- 
ments have been met, and then generate constraints that check (a) presence of 
that information, and (b) verify its correctness. For the purposes of this paper, we 
focus on the legal basis of given consent, with a subset of the requirements and 
constraints presented in Table1. Checking for presence of information before 
verification of correctness follows a closed-world assumption where absence of 
information indicates non-compliance. 

Constraints that verify correctness, or rather conformance, to requirements 
are required to be implemented based on underlying information representa- 
tions (e.g. ontology). Some constraint assessments can be automated whereas 
others require human intervention, particularly where qualitative requirements 
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Table 1. Subset of Constraints and Assumptions regarding Given Consent 


GDPR Constraint 

A4-11 Consent must be associated with only one Data Subject 

R32,A4-11 Consent must have one or more categories or types of personal 
data associated with it 

R32,R42 Consent must have one or more purposes associated with it 

R32,A4-11 Consent must have one or more processing associated with it 

AT-3 Consent must have one and only one state/status 

AT-2 Consent is given by exactly one Person 


Given consent must have information on how it was obtained 


Consent must have artefacts associated with how it was obtained 


Consent must have information on what choices provided 


Consent must have statement or affirmative action 


Consent must have information about right to withdraw 


R32,A7-2 Consent must not have more than one medium it was provided 


Consent must have a timestamp indicating when it was given 


Purpose or processing associated with Third Party must specify 
role played by the Third Party 

If data is being stored, it must have information on how long it 
will be stored for 


Storage of data must have information on its storage location 


R71,A9-2c,A22-2 | Automated processing of personal data must be clearly indicated 


R111,A49-la Data transfer to third country or international organisation 
must specify identity of recipient 

R51,A8-2a Personal data belonging to a special category must be clearly 
indicated 


are involved. For example, informed consent requires the request to be clear and 
unambiguous - which needs to be evaluated manuallyt. 

A test for compliance contains verification of (one or more) constraints where 
results indicate compliance with identified requirements. By linking the con- 
straint with relevant points or concepts within GDPR, it is possible to generate 
and document ‘coverage’ of compliance. For example, for constraints generated 
from identified requirements, by having their results linked to the GDPR, the 
number of tests passed indicates compliance with set of linked GDPR points or 
articles. 

Constraints can be linked to each other to formulate dependency relation- 
ships. This can make testing for compliance more efficient by identifying common 
dependencies. It also allows creating logical groupings of related constraints. Such 
groupings can be based on functionality or relation to GDPR such as association 


1 While it may be possible to use NLP-based approaches to evaluate the complexity of 
language to determine whether it is clear and unambiguous, such approaches cannot 
be assumed to be universally applicable, and therefore require a manual assessment. 
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with one concept or one specific article. For example, requirements for validity of 
consent are grouped from individual constraints for each requirement (e.g. clear, 
unambiguous), with requirements for explicit consent containing only additional 
constraints along with the group for valid consent. 


2.2 Model of Processes 


Representing a model or template of processes as machine-readable metadata 
has advantages in terms of ex-ante verification of compliance. This allows creat- 
ing constraints that specifically check whether the model of processes follows the 
requirements gathered from GDPR. This is distinct from verification of compli- 
ance using records or logs of processing which constitute as ex-post compliance. 
For example, verifying whether the consent collection mechanism follows require- 
ments for valid consent is done by representing the mechanism as a model and 
checking constraints associated with validity of given consent. 

The model also allows testing for existence of internal processes regarding 
handling of data subject rights and data breaches. The metadata representa- 
tion of model enables creating a persistent snapshot of processes for planning, 
conducting an impact assessment (DPIA), and inspecting past compliance. Addi- 
tionally, creating and testing a model allows abstraction of information common 
to instances such as notice or dialogue for consent - which is common to all or 
a significant number of data subjects. By abstracting such common information 
into the model of the process, actual instances of given consent need to be linked 
only with the relevant attributes and can refer to the model for more information 
regarding compliance. 

Using models also makes the testing process more efficient in terms of reduc- 
ing the number of tests to be conducted. If a model is verified to be compliant 
using prior testing, then its instances can be verified to be compliant using only 
the constraints specific to the instance. For example, when verifying compliance 
for processing using given consent as a legal basis, the validity of given consent 
also needs to be evaluated. By abstracting the model of collecting consent and 
verifying it to be compliant, the given consent used in processing is assumed to 
be valid. The only constraint that needs to be tested is therefore whether the 
processing is permitted based on the interpretation of given consent. 


2.3 Testing and Documentation 


The requirements and constraints by themselves are universal in that they can be 
expressed without dependence on any technology or information representation. 
Adapting constraints into an testing framework requires basing it on the under- 
lying models and information representations. For example, where information is 
defined using RDF+OWL, the testing framework is created using relevant tech- 
nologies that can query and validate RDF+OWL - such as using SPARQL [19] 
and SHACL [9] respectively. In this case, the information format (RDF) itself 
enables the use of semantics which assists in linking the information, constraints, 
and results with points of relevance within the GDPR. Where the underlying 
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information format does not inherently supporting semantics, these can be added 
as metadata to the test results to link them with GDPR. 

Having the information or metadata format be machine-readable and inter- 
operable allows taking advantage of querying and validation. The testing frame- 
work needs to be aware of the vocabularies and technologies used to represent the 
information and should persist results using machine-readable metadata. Tests 
should be defined at a granular level to enable actionable constraints such as 
“personal data (category) should have a source”. These are then combined to 
create larger and more complex tests, which is similar to the creation of ‘unit’ 
tests and combining them into modules to test complex functionality. For exam- 
ple, testing whether personal data collected from users and shared with a third 
party with legal basis of consent adheres to given consent requires verification 
using constraints that test - (a) source of personal data (user) (b) third party 
identity (c) legal basis, and (d) matching processing with given consent. 

The results of tests are associated with articles or concepts within GDPR 
based on the requirements used to generate constraints. Depending on the extent 
of machine-readable information used, it is possible to also include information 
such as (a) representation of processes (b) testing constraints (c) results of inter- 
nal evaluations (d) text of GDPR. The end result of the testing process is a report 
that lists compliance with GDPR in the form of requirements (un-)fulfilled. 


3 Demonstration Using Use-Case 


3.1 Creating the Data Graph 


For the use-case, we chose the consent mechanism on quantcast.com website, 
depicted in Fig. 1, and modelled the data graph based on information presented 
in the consent dialogue and the website. The choice of website was made based 
on Quantcast being a provider of GDPR consent collection mechanism using the 
IAB consent framework”. The website was also one of the few (to the authors’ 
knowledge) that allows changing/withdrawing consent using the same dialogue. 
We chose to include information from the website about analytics services pro- 
vided by Quantcast as it uses personal data. More information on the creation 
of data graph is available online. 

We used GDPRov* (which extends PROV-O [10] and P-Plan [3]) to model 
personal data and consent workflows, and GConsent? to model consent attributes 
and given consent. GDPRov allowed representing processes and personal data 
mentioned in the consent dialogue as models. GConsent allowed expressing con- 
sent using attributes such as medium and status. Where there was an overlap, 
such as for personal data and purpose, we used both to define the instance. 


? TAB Transparency and Consent Framework https://advertisingconsent.eu/. 
3 Paper and Resources https: //w3id.org/GDPRep/semantic-tests. 

4 GDPRov Ontology https://w3id.org/GDPRov. 

5 GConsent Ontology https: //w3id.org/GConsent. 
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(b) 
Quantcast are 


We value your privacy 


Quantcast 


Quantcast and our partners use technology such as cookies on our site to personalize content and ads, provide 
social media features, and analyze our traffic. You can toggle on or off your consent preference based on purpose 
for all companies listed under each purpose to the use of this technology across the web. You can change your 
mind and revisit your consent choices at anytime by returning to this site. 


We value your privacy 


QUANTCAST 


We and our partners use technology such as cookies on our site to personalise content and ads, 
provide social media features, and analyse our traffic. Click below to consent to the use of this 
technology across the web. You can change your mind and change your consent choices at 
anytime by returning to this site. 


1 ACCEPT 


Show Purposes 


Information storage and access 


‘The storage of information, or access to information that is already stored, on your device 
such as advertising identifiers, device identifiers, cookies, and similar technologies. 


Personalisation 


The collection and processing of information about your use of this service to subsequently 
personalise advertising and/or content for you in other contexts, such as on other websites 
or apps, over time. Typically, the content ofthe site or app is used to make inferences about 


your interests, which inform future selection of advertising and/or content. 


Ad selection, delivery, reporting 


a 
; 
< Baek See talvera is 


(c) 
Quantcast nasser Au Quantcast never At 


Personalisation : We value your privacy 


The collection and processing of information about your use of this service to subsequently Quantcast and our partners use technology such as cookies on our site to personalize content and ads, provide 
= social media features, and analyze our traffic. You can toggle on or off your consent preference based on purpose 
for all companies listed under each purpose to the use of this technology across the web. You can change your 
or apps, over time. Typically, the content of the site or app is used to make inferences about Hide Companies mind and revisit your consent choices at anytime by retumming to this site. 


your interests, which inform future selection of advertising and/or content. 


personalise advertising and/or content for you in other contexts, such as on other websites 


QUANTCAST 
COMPANY 
: Information storage and access 
1020, inc. dba Placecast and Ericsson Emodo ott 
The storage of information, or access to information that is already stored, on your device 
33Across ot 
such as advertising identifiers, device identifiers, cookies, and similar technologies. 
AMob off ara 
Personalisation 
Accorp Sp. z 0.0. on 
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Fig. 1. Consent dialogues on quantcast.com (clockwise from top-left) (a) first screen 
(b) default options on selecting “I Accept” (c) default options on selecting “Show 
Purposes” (c) Third parties listed for purpose “Personalisation” 


We collected personal data categories from the descriptions in the consent 
dialogue as well as other pages on the website describing various products and 
services offered by Quantcast. We defined the source of personal data as ‘user’ 
where data collection was mentioned in the consent dialogue, and ‘third party’ 
where explicitly defined. We defined processes for addressing the rights provided 
by GDPR using descriptions provided in the privacy policy. Where a URL or 
email address was provided regarding rights, we defined it as the IRI of the 
process for handling that right. We defined the IRI for DPO using the contact 
point provided in the policy. 

We represented the consent collection mechanism on the website as an 
instance of gdprov:ConsentAcquisitionStep. This was defined as a step in the 
process QChoice representing the product Quantcast Choice. Similar processes 
were defined for Marketing, Advertisement, and Measurement identified from 
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the information on the website. Each top-level description in the consent dia- 
logue, e.g. Personalisation, was modeled as gdprov:Purpose and gc:Purpose with 
processing and personal data modeled from its description. The legal basis was 
defined using GDPRtEXT® and was associated at the process (purpose) or step 
(processing) level. We used given consent as the legal basis for purposes men- 
tioned in the consent dialogue and legitimate interest otherwise. 

In the consent dialogue, the use of independent radio buttons was interpreted as 
allowing the user to consent and withdraw for each individual purpose, which was 
represented by creating separate instances of consent for each choice. We modelled 
the dialogue as an instance of gdprov: ConsentAgreementTemplateBundle consist- 
ing of several gdprov: ConsentAgreement Template instances to represent multiple 
individual consent entities. We had difficulty in interpreting the language used for 
third parties as it suggests the user is giving consent directly to third parties rather 
than to Quantcast. Pending clarification from legal experts, we chose to represent 

these as data recipients rather than as Controllers or Joint-Controllers for ease of 
testing. This allowed us to represent the data sharing processes in a concise manner 
with each purpose being associated with the hundreds of third parties listed in the 
consent dialogue rather than defining a separate consent representation for every 
third party. For testing, we defined an instance of given consent (see Fig. 2.) which 
was then later withdrawn. All resources associated with the data, constraints, and 


queries are available online(see footnote 3). 
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Fig. 2. Visualisation of Given Consent in the data graph (using GraphDB) 


€ GDPRtEXT Ontology and Resource https://w3id.org/GDPRtEXT. 
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1| :WithdrawConsentConstraints a sh:NodeShape ; 

2 sh:targetClass m:ManualTest ; 

sh:property :WithdrawConsentEase ; 

4 sh:property :WithdrawConsentInformation ; 

rdfs:label "Withdraw Consent Constraints" . 

6| :WithdrawConsentEase a sh:PropertyShape, :ManuallyCheckedConstraint ; 
:linkToGDPR gdpr:article7-3 ; 

8 sh:name "Ease of Withdraw Consent" ; 

sh:path m:withdrawingConsentIsAsEasyAsGivingConsent ; 


10 sh:hasValue true ; 

11 sh:message "(M) Consent should be as easy to withdraw as it is to give" 
12) :WithdrawConsentInformation 

13 a sh:PropertyShape, :ManuallyCheckedConstraint ; 

14 :linkToGDPR gdpr:article7-3 ; 


15 sh:name "Withdraw Consent Information" ; 

16 sh:path m:withdrawingConsentInformationBeforeGivingConsent ; 

17 sh:hasValue true ; 

18 sh:message "(M) Information about withdrawal should be provided before giving 


consent" . 


Listing 1.1. SHACL constraints for manual tests regarding consent withdrawal 


3.2 Testing Data Graph for Compliance 


We defined constraints over the data graph using SHACL and its extension 
SHACL-SPARQL [9]. For testing, we used the SHACL validator binary pro- 
vided by TopBraid’. To distinguish between constraints that could be veri- 
fied automatically and those that required manual consideration, we subclassed 
sh:NodeShape as AutomaticallyCheckedConstraint and ManuallyCheckedCon- 
straint where manual tests checked the value of boolean properties. For example, 
the value of consentIsBySilence indicates whether consent is given by silence 
with valid value being zsd:false. The consent collection dialogue was considered 
as the input for manual tests regarding validity of consent. Appropriate result 
messages were associated with each constraint using sh:message. The property 
linkToGDPR was defined to linking constraints with GDPR using GDPRtEXT. 
An example constraint is provided in Listing 1.1. 

For evaluation, we defined two sets of constraints following the outline pro- 
vided in the approach described in Sect. 2. The first set validated instances of 
given consent against defined constraints, whereas the second set first validated 
the model of consent and then validated the instances of given consent using 
the validated model. For the second set, results from validating the model were 
persisted in data graph in order to use them as input to validate given consent. 
A simple bash script was used to construct a pipeline that executed constraints 
and stored results as a rdf/turtle file. 

For ease of evaluation, we generated a combined data graph consisting of data 
from Quantcast and ontologies used (GDPRov, GConsent, GDPRtEXT). We 
added this data graph along with results of SHACL validation to a triple-store 
(GraphDB Free Edition) under separate graphs. We then executed SPARQL 
queries to query the data graph and generate reports. 


7 TopBraid SHACL https://github.com/TopQuadrant/shacl/. 
8 GraphDB Triple-Store http://graphdb.ontotext.com/. 
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We used three separate queries to facilitate different actions associated with 
compliance. The first query listed the distinct messages from failing tests as 
actionable items. The second query listed the compliance of applicable GDPR 
articles using links from constraints and their verification. The third query, shown 
in Listing 1.2, generated a test report, depicted in Table 2, containing the con- 
straint description, type - automatic (A) or manual (M), link to GDPR, result - 
pass (P) or fail (F), node (instance in data graph), and failure message (not 
shown in table). The results from these queries were then used to generate a 
compliance report to document the state of maintaining compliance and actions 
required. The report contains results of queries related to compliance [16]. The 
documentation regarding creating the data graph, constraints, and testing, along 
with the SPARQL queries and generated report is available online (see foot- 
note 3) Fig. 3. 
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Fig. 3. Overview of testing process 


4 Related Work 


The approach presented in this paper acts on machine-readable metadata repre- 
sentation of processes and workflows associated with personal data and consent. 
An alternative to this is an approach that uses ODRL policies [5] for assessment 
of compliance using questions constructed from GDPR [1]. The ODRL policy 
consists of constraints classified as Feature, Discretional, and Dispensation with 
Rule used to specify them as Permission, Prohibition, or Duty. The policies are 
linked to the relevant text in GDPR using RDF properties similar to the use of 
GDPRtEXT in this paper. The questions are used in a tool that incorporates 
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Table 2. Report showing constraints, validation results, and link to GDPR 


Name Type|GDPR Result | Node 

Consent Æ Inactivity M R32 iP 

Consent # Pre-ticked Boxes M R32 P 

Consent # Silence M R32 P 

Consent — Data Subject A A4-11 P 

Consent — Given To A P 

Consent — Location A P 

Consent — Medium A A7-2 P 

Consent — Personal Data A A4-11,R32 |P 

Consent — Processing A A4-11,R32 |P 

Consent — Provided By A A7-2 P 

Consent — Purpose A R32,R42 P 

Consent — Status A P 

Consent — Timestamp A F Q:Consent20190415120753 
Consent — Timestamp A F Q:Consent20190415140000 
Consent = Choice M P 

Consent = Freely Given M A4-11 P 

Consent = Specific M A4-11 P 

Consent = Statement of Clear Action | M A4-11 P 

Consent = Unambigious M A4-11 P 

Consent Generating Activity A P 

Consent Request = Clear M R32 P 

Consent Request = Concise M R32 R 

Consent Request = Not Disruptive M R32 P 

Consent Template A P 

Ease of Withdraw Consent M A7-3 P 

Many Processing x One Purpose A R32 P 

One Processing x Many Purposes A R32 F Q:Consent20190415120753 
One Processing x Many Purposes A R32 F Q:Consent20190415140000 
Personal Data — Storage Period A A13-2-a F Q:CAT QInfoStorageAccess 
Personal Data — Storage Period A A13-2-a F Q:CATTPInfoStorageAccess 
Personal Data — Storage Period A A13-2-a,R39 | F Q:Consent20190415120753 
Personal Data — Storage Period A A13-2-a,R39 | F Q:Consent20190415140000 
Right to Withdraw A A7-3 P 

Separation of Processing M R43 P 

Third Party Categories A A44 P 

Third Party Identities A A13-1-e P 

Third Party Identities A A30-1-d P 

Third Party Identities A A44 P 

Third Party Safeguards A P 

Withdraw Consent Information M A7-3 P 
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1| PREFIX c: <http://example.com/Quantcast/shapes#> 

2| PREFIX sh: <http://www.w3.org/ns/shacl#> 

>| SELECT DISTINCT ?name ?test ?gdpr ?result ?node ?msg 
1| WHERE { 

?x a c:Constraint . 


6 ?x sh:name ?name . 


7 BIND ( 

8 IF(EXISTS{?x a c:AutomaticallyCheckedConstraint}, 
9 "Automatic"**xsd:string, "Manual"**xsd:string) 
10 as ?test) 

11 OPTIONAL { ?x c:linkToGDPR ?gdpr } 

12 BIND ( 


13 IF(EXISTS{?y sh:sourceConstraint ?x}, 
14 "FAIL"*"xsd:string, "PASS"*"xsd: string) 


15 as ?result) 


16 OPTIONAL { 

17 FILTER EXISTS { ?y sh:sourceConstraint ?x } . 
18 ?y sh:focusNode ?node . 

19 ?y sh:resultMessage ?msg . } 


20| } ORDER BY ?name 


Listing 1.2. SPARQL query for report listing validation results linked with GDPR 


human feedback and generates an assessment report. This is useful to incorpo- 
rate the manual testing requirements from our approach, as well as to present 
the results from validation as a feedback process. 

The Scalable Policy-aware Linked Data Architecture For Privacy, Trans- 
parency and Compliance (SPECIAL) is an European H2020 project that pro- 
vides a semantic-web framework for the generation of logs that enable ex-post 
GDPR compliance verification [8]. Their compliance engine can also be used to 
perform ex-ante compliance checks [2] using a model-based approach similar to 
the one advocated by GDPRov. The compliance assessment in SPECIAL focuses 
on determining whether the specified use of purposes, processes, and personal 
data is allowed by the specified legal basis such as consent. This can be incor- 
porated in our approach to determine the validity of constraints related to use 
of given consent for data processing operations. 

Other related work includes PrOnto [11] - a legal ontology of concepts related 
to privacy agents, personal data types, processing operations, rights and obliga- 
tions. Based on the examples shown in its associated publications, PrOnto can 
be used to define the underlying data graph and the constraints for compliance 
validation. The W3C Community Group for Data Protection Vocabularies and 
Controls? (DPVCG) is currently working on taxonomies for purposes, data pro- 
cessing, consent, personal data, technical and organisational measures, and legal 
basis which will provide a vocabulary for the representation and documentation 
of such processes. Layered Privacy Language (LPL) [4] can be used to model 
privacy properties such as personal privacy, user consent, data provenance, and 
retention management for the GDPR, and can be used to define the constraints 
using its authorisation-based modeling. 


° DPVCG https: //www.w3.org/community /dpvcg/. 
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5 Discussion 


In this section, we provide a broad discussion of how our test-driven approach 
can be used as a practical tool by stakeholders and the challenges in its adoption 
for real-world cases. Considering that processes and activities in an organisations 
are traditionally documented without semantics, it could be tedious and cum- 
bersome to adopt the semantic-web based framework described in this paper. 
However, as mentioned earlier, the test-based approach can also be used with 
existing representations by adding semantics to the test results and reports to 
link them with relevant information such as the articles in GDPR. This is also 
applicable towards persisting outputs of reports generated from tools [1] and 
conformity assessments (CAP) [6]. 

The advantages of representing processes with semantics goes beyond test- 
ing for compliance as representation of processes are also useful for planning of 
operations and internal documentation. Semantic representations of processes 
can assist in automating the generation of documentation such as privacy poli- 
cies where processes are listed along with their purpose, legal basis, and use of 
personal data. Privacy policy generators that generate boilerplate policies exist 
online, but do not incorporate semantics. The use of semantics allows queryable 
machine-readable metadata that can be used in tools towards understanding and 
evaluating complex policies for users and authorities. 

The modeling of third parties as data recipients in Sect. 3.1 shows the chal- 
lenges in representing complexities when it comes to GDPR compliance. A report 
of cases regarding data protection [20] further shows instances where individual 
use-cases differ significantly, which could indicate that an universal ontology to 
represent such processes may not be feasible. A more practical approach could 
be to create taxonomies and use them in ontology design patterns for compli- 
ance. The DPVCG taxonomies could be used alongside existing ontologies to 
create compliance design patterns to address GDPR requirements. This follows 
open technological solutions such as the SPECIAL project that drive adoption 
of semantics in the regulatory compliance space. 


6 Conclusion 


This paper demonstrates the benefits of using a test-driven approach towards 
maintaining ongoing GDPR compliance by using semantic representations of 
processes. The approach generates and checks constraints for adherence to 
GDPR requirements and persists the results towards compliance documenta- 
tion. The prototype demonstration provides an example of testing using a real- 
world website’s consent mechanism using previously-published ontologies to rep- 
resent processes (GDPRov), consent (GConsent), and GDPR (GDPRtEXT), 
with SHACL used to test requirement constraints. 

In conclusion, the generation of compliance reports by incorporating seman- 
tics into the testing process is useful to maintain and document the state of 
compliance at a given time as well as to demonstrate the ongoing compliance for 
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changes to the data processes within an organisation. While the demonstration 
in this paper only covers a small set of requirements for GDPR, namely those 
associated with given consent, it is sufficient to demonstrate the value of the 
approach and the use of semantics for compliance. 
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Abstract. Philology and hermeneutics as the analysis and interpreta- 
tion of natural language text in written historical sources are the prede- 
cessors of modern content analysis and date back already to antiquity. In 
empirical social sciences, especially in sociology, content analysis provides 
valuable insights to social structures and cultural norms of the present 
and past. With the ever growing amount of text on the web to analyze, 
also numerous computer-assisted text analysis techniques and tools were 
developed in sociological research. However, existing methods often go 
without sufficient standardization. As a consequence, sociological text 
analysis is lacking transparency, reproducibility and data re-usability. 

The goal of this paper is to show, how Linked Data principles and 
Entity Linking techniques can be used to structure, publish and analyze 
natural language text for sociological research to tackle these shortcom- 
ings. This is achieved on the use case of constitutional text documents of 
the Netherlands from 1884 to 2016 which represent an important contri- 
bution to the European cultural heritage. Finally, the generated data is 
made available and re-usable as Linked Data not only for sociologists, but 
also for all other researchers in the digital humanities domain interested 
in the development of constitutions in the Netherlands. 


Keywords: Cultural heritage - Sociology - NLP - Linked Data - 
DBpedia 


1 Introduction 


Since the earliest existence of writing, text served as a means of human to 
human communication and is firmly established in human cultures [39]. The 
development of the Web and (for instance) the establishment of optical char- 
acter recognition (OCR) and automated speech recognition (ASR) technologies 
increased the amount and diversity of natural language text available to humans 
and machines. Cultural heritage often is manifested in text and by now, numer- 
ous means to explore cultural heritage exist to make the data accessible and 
explorable to a broad audience, including interactive visualizations and recom- 
mendation systems. However, in order to understand cultural heritage scien- 
tifically, fields like digital humanities and social science exist. For sociologists, 
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this unthinkable expanse of information captured in the form of text provides 
an important entry to social realty [27]. Sociological content analysis therefore 
also represents a necessary gateway to understanding cultural heritage and the 
social reality cultural heritage data captures. The mentioned cultural heritage 
exploration tools created for a broad audience however, are often not sufficient 
for sociologists to perform a scientific content analysis with. Instead, tools are 
needed to process, store, model, annotate (code) and analyze the data in order 
to develop new theories or test existing theories. With the increasing amount 
of text to be analyzed, also more technologies have been created to fulfill these 
tasks. In sociology, computer-assisted content analysis started out with (from 
today’s perspective) simple frequency and valence analyses during the 1950s [35] 
and grew to more sophisticated statistical Natural Language Processing (NLP) 
approaches which became increasingly accurate and efficient in a way that they 
supported to uncover linguistic structures as well as semantic associations [11]. 
By now, a broad range of interesting and promising methods of computer assisted 
data acquisition and analysis have established. However, [29] criticizes that espe- 
cially in social scientific research, no standardized and systematic means of the 
analysis of complex text material has emerged. [27] emphasizes the necessity to 
establish universal standards for a sustainable computer assisted text mining in 
sociology. Another problem in sociology regards data sharing, which is to this 
day widely not standardized and often not practiced at all [17,44]. According 
to [4], this lack of transparency lowers the integrity and interpretability of the 
performed research and its results. Another widely discussed issue in sociology 
is the re-use of research data, especially qualitative data [30]. A study by [6] 
suggests that sociologists generally welcome re-using research data in sociology, 
but certain aspects which includes the difficulty of finding and accessing these 
data often prevents them to do so. 

The Semantic Web provides “a common framework for the liberation of 
data” [1] by giving data an independent existence [13]. As the Linked Open 
Data Cloud! visualizes, numerous domains have already not only firmly estab- 
lished methods to utilize the possibilities provided by Linked Data, they have 
also found ways to take part in the development, providing new applications 
based on the general idea. However in the field of sociological content analy- 
sis, Linked Data has so far not played an important role despite the promising 
standards and principles it entails. 

The goal of this paper is to leverage Linked Data and its principles for 
computer-assisted sociological content analysis. Furthermore, it is demonstrated 
how this field of research can benefit from the mentioned data liberation process. 
Thereby, open research problems in both, the Linked Data and social science 
communities are discussed which (if solved) may improve the process of content 
analysis in the future. A lesson learned here is that in order to better understand 
cultural heritage data and its meaning for the society it originated in, the Linked 
Data research community is challenged to support sociologists in improving their 
research process to be more transparent, reproducible and re-usable. 


1 https: //lod-cloud.net/, last accessed: May 12, 2019. 
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This paper demonstrates and discusses intersection points between Linked 
Data and content analysis in sociology on the foundation of the use case of 
constitutional text documents of the Netherlands from 1884 to 2016. The use case 
is generalizable and integrates Linked Data in sociological text analysis on a real 
world research example and thereby utilizes and discusses knowledge engineering, 
Named Entity Linking (NEL), and querying. Building on the previous work 
achieved in [41], the presented paper takes the Linked Data perspective instead 
of the sole sociological view. 

This paper is structured as follows. In the following Sect. 2, relevant previous 
works on the intersection between social science and Linked Data are presented. 
Section 3 presents the use case of constitutional text documents and on this 
foundation, sociological content analysis techniques in combination with Linked 
Data technologies are discussed in Sect. 4.3. Section 5 closes this paper. 


2 Related Work 


To the best of our knowledge, no previous work exists which discusses the inter- 
section between content analysis in sociology and Linked Data in the hereby 
presented depth. [13] motivated this work mostly, because the author pointed 
out the possibilities and necessity of Semantic Web technologies in this socio- 
logical analysis process. [2] defines annotation requirements to be implemented 
in cultural heritage annotation projects. The results are based on case studies 
at the National Library of Latvia. While the results are insightful, they do not 
completely apply to the process of content analysis in sociology. [11] emphasizes 
the foundations and applications of text mining in sociology, however, without 
discussing Linked Data applications. The use case to reveal intersection points 
between sociologists and the Linked Data community involves Dutch consti- 
tution documents. These documents were converted from their original XML 
format into RDF. The constitute project? as presented by [10] aimed at cre- 
ating a platform for professionals drafting constitutions, and thus requiring to 
read and compare constitutions of various countries with each other. The main 
differences to the work presented here are (1) that the data is modeled not for 
constitution drafting but for a scientific content analysis and (2) the documents 
by [10] represent the latest version of a constitution and not all historical editions 
as it is the case in this presented paper. 


3 Use Case 


To asses the feasibility and benefits of modeling, storing, annotating and query- 
ing documents for sociological content analysis based on Linked Data, a gener- 
alizable research example of constitutional documents was chosen. The original 
document corpus was created by [23]. It consists of 20 XML documents with each 
one version of the Dutch constitution from 1884 to 2016 in German language. 


? https://www.constituteproject.org/ontology/, last visited: May 12, 2019. 
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The previous work achieved by the authors is as important as it was cumbersome, 
since no machine-readable and chronological dataset of European constitutions 
is publicly available on the Web. Even though an HTML representation of these 
constitutions in German language exists on the Web®, the information which 
changes appeared in which constitution edition is presented in an unstructured 
way. In sociological research, constitution texts enable to learn about state iden- 
tities, definitions of affiliations (e.g. citizens, foreigners, heads of state) and their 
change over time [3,12,28]. Constitutions can be viewed as a mirror of society 
and as a self-description of the state in the context of global societies [14] and 
therefore represent an important contribution to cultural heritage. Sociological 
research questions involving constitution texts include the modeling of the rela- 
tionship between the state and the citizen [23], the modeling of gender in a 
state [16,20] and religious freedom [26,40]. Constitutions follow a strict struc- 
ture and hierarchy. Each document is divided into several main chapters which 
are furthermore divided into paragraphs, articles and sections. As often required 
in sociological content analysis, studying constitutional documents requires to 
research their structure, their content as well as their changes over time. Even 
though this use case covers only one domain of research for sociologists, it poses 
versatile research problems and is generalizable to a broad range of cultural 
heritage texts used for content analysis in sociology. 


4 Linked Data Enabled Content Analysis for Sociology 


In sociological research, data sharing and publishing is neither standardized nor 
is it widely practiced. Studies by [44] and [17] show that social science journals 
have just been starting to slowly adapt data sharing policies and most journals 
which enforce data publishing policies do so mostly in an incomplete and varied 
way. The problem gets more clear when having a look at the research process 
itself. In sociology, content analysis is generally performed in a process in which 
data is pre-processed (this can involve digitizing content as well as transforming 
the data into the needed format for analysis), followed by a coding process (i.e. 
categorizing the data in varying depth) and an analysis of the produced data to 
establish first hypotheses or test theories. However, as mentioned in Sect. 1, this 
process lacks standardized methods and reproducibility which jeopardizes the 
integrity of research results. This section addresses all three steps in this research 
process and shows how Linked Data can help to improve its reproducibility and 
transparency based on the use case scenario described in Sect. 3. Moreover, a 
number of insufficiencies are discussed which pose interesting long-term research 
questions for interdisciplinary research. 


4.1 Modeling and Publishing Documents 


The corpus introduced in Sect.3 was originally created and made available as 
XML by [23]. While XML provides a number of benefits regarding the way data 


3 http://www.verfassungen.eu/, last accessed: May 12, 2019. 
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can be encoded syntactically, the format also has many disadvantages in contrast 
to RDF, especially in terms of re-usability, data extension and linking to external 
resources [8]. [10] specifically point out the benefits of publishing constitutional 
documents as RDF rather than XML. The data from the presented use case 
were converted to Linked Data according to the best practices specified by the 
W3C [19]. The Constitute Project already developed an ontology for this domain 
of constitution documents, which was reused and adapted. The ontology treats 
all parts of a constitution in the same way, regardless of its structural element 
(e.g article, section, paragraph). However, the information whether a piece of 
text belongs to a specific paragraph or a chapter is needed for querying in the 
context of a sociological analysis, therefore the ontology was adapted accordingly. 
Furthermore, the ontology by [10] models the year the respective constitution was 
created in, but often several constitution versions are created in the same year. 
The ontology was adapted accordingly for the presented use case. [23] created the 
constitutional XML documents which were utilized in the presented work. As a 
contribution for this paper, the data were modeled and published as Linked Open 
Data. As a result, anyone is now able to re-use the data, query the data using the 
standardized SPARQL query language as opposed to proprietary XML parsers, 
and to reference each single semantic unit of a document separately. An example 
snippet of the generated RDF data is depicted in Fig. 1. All generated RDF data 
are made available on Github*. For sociologists to model and publish their data 
as Linked Data for content analysis to become better reproducible and re-usable, 
this process seems straight forward. In order to find existing vocabularies for 
re-use, several tools exist, including Linked Open Vocabularies® or Prefix.cc®. 
Furthermore, there are a number of tools and guides to support researchers in 
the development and reuse of ontologies, e.g. [18,32,37]. 


4.2 Semantic Annotation 


As mentioned above, a major part of the analysis of textual content in sociology 
is referred to as coding. This process means to categorize texts for analysis in 
order to develop new theories or test existing ones. One issue in this process is 
that often closed source tools are used which store the resulting data in propri- 
etary formats (e.g. MAXQDA’ or ATLAS.ti®). If neither the textual mentions 
the code is referring to nor the terms or categories used for coding (and their 
relationships) are made available immediately together with the concluding text 
drawn from the analysis, the research is not reproducible. One solution is to 
implement semantic annotation which makes use of ontologies, which explicitly 
structure knowledge and define relationships between concepts and individuals. 
On the example of the described use case, this section demonstrates and discusses 
semantic annotation for the content analysis process in sociology. 


* https: //github.com/tabeatietz/semsoc, last accessed: May 12, 2019. 
5 https://lov.linkeddata.es/dataset/lov/, last visited: May 12, 2019. 

6 http://prefix.cc/, last visited: May 12, 2019. 

T https://www.maxqda.com/, last accessed: May 12, 2019. 

8 https://atlasti.com/, last accessed: May 12, 2019. 
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@prefix co: <http://www.constituteproject.org/ontolagy/> . 


@prefix s: <https://github.com/tabeatietz/semsoc/> . 
"Verfassung des Königreichs der Niederlande" 


co:hasName 


s:constitution_24_n_2016-constitution 


s:edition 


co:isCreatedin 


co:isConstitutionOf 


"Hauptstück 2 - Regierung" 


co:sectionID 


co:isSectionOf 


s:constitution_24_n_2016-constitution_t2 


i 


co:parent 


co:rowType s:Paragraph 
co:text a 
"235" co:sectionID s:constitution_24_n_2016-constitution_t2_s2 
ico:header| Co:rowType A 
2 co:title 
a "Art. 42" co:parent Po 
co:rowType 
co:Section co:text 
a 
N "236" co:dectionID s:constitution_24 n_2016-constitution_t2_s2_al a s:Article 
Se F co:header co:rowType 
a 
EA co:parent TA 
i 


co:sectionID 


s:Section 


co:header 


ie co:parent a 
co:sectionID s:constitution_24_n_2016-constitution_t2_s2_al_s1 co:rowType: 


co:text 
"(1) Die Regierung besteht aus dem König und den Ministern" 


Fig. 1. Visualization of a subset of the generated RDF graph 


Annotation System. Manual or semi-automatic annotation of text with enti- 
ties from a large knowledge base like DBpedia requires an efficient user interface. 
The task of the user interface is to suggest possible entity candidates to the anno- 
tating user based on an input text. One of the major challenges is to present 
the entities in a way that users unfamiliar with Linked Data (lay-users) are able 
to make use of the interfaces. Lay-users typically have no further insight about 
what the content of a knowledge base is or how it is structured, which has to 
be considered when suggesting the entities the user should choose from [38]. 
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Fig. 2. refer Modal annotation interface 


Some entity mentions yield to lists of thousands of candidates which a human 
cannot survey quickly to find the correct one. Therefore, autosuggestion utili- 
ties are applied to rank and organize the candidate lists according to e. g. string 
similarity with the entity mention, or general popularity of the entity [34]. 
There exist many semantic annotation systems, as e.g. [9], which enables 
semi-automated semantic text annotation in realtime. This feature seems 
promising but is not applicable for the presented use case. The Pundit Annotator 
Pro by [31] allows users to define their own properties and knowledge bases. How- 
ever, the annotator is not available for free. Another alternative is the INCEp- 
TION annotator by [22] which implements a variety of complex linguistic and 
semantic annotation functionalities. However, to semantically annotate parts of 
the constitution documents from the mentioned use case and to assess the suffi- 
ciency of the DBpedia knowledge base and annotation techniques for sociology, 
the refer annotation system was used [42]. refer consists of a set of powerful tools 
focusing on NEL. It aims at helping text authors to semi-automatically analyze 
textual content and semantically annotate it with DBpedia entities. In refer, 
automated NEL is complemented by manual semantic annotation supported by 
sophisticated autosuggestion of candidate entities. refer is chosen for this task, 
because it fulfills all annotation criteria mentioned by [21], is publicly available®, 
and configurable. Furthermore, a user study focusing on lay-users has shown 
that the refer annotation interface is easy to use and enables a sophisticated 
annotation process for lay users [42]. The user can choose between a manual 
and automated annotation process. For automated annotation, refer deploys 
KEA-NEL [43]. For manual (or semi-automated) annotations, the refer annota- 
tor includes two configurable interfaces for creating or correcting annotations: 
the Modal annotator, shown in Fig. 2 and an the Inline annotator. The interface 


° https: //www.refer.cx/, last accessed: May 12, 2019. 
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leaves sufficient space for displaying relevant entities and additional information. 
Also, it provides a useful parallel view of all available categories. While this man- 
ual method seems (and is) cumbersome, it enables to evaluate the feasibility of 
DBpedia for constitution documents in depth. 


Annotation Criteria. For the presented annotation task, several annotation 
criteria were defined, crucial for reproducibility. For sociological text analysis, 
it is generally assumed that rigid as well as non-rigid designators are important 
[25]. The rationale here is to generate as much knowledge as possible from the 
text to be able to analyze the data from multiple perspectives. Further entity 
annotation criteria regard entity specificity and completeness. It was defined to 
annotate textual mentions with semantic entities as specific and as complete 
as possible. A’Not In List’ (NIL) entity was created and included in the con- 
figurable annotation interface. Whenever the annotating user encountered an 
entity not available in the knowledge base, the NIL entity was used to assess 
the level of completeness of the annotations and the sufficiency of the knowl- 
edge base. When annotating historical text documents for scientific analysis, it 
is especially important to acknowledge the entities’ temporal role. That means, 
if a text in a Dutch constitution document edition from the year 2016 mentions 
a term like’der König’ (the King), the term was annotated with the DBpedia 
resource dbr:Willem-Alexander_of the Netherlands. This task of temporal role 
detection is part of current research in NLP. Advances in this field have been 
accomplished by [24], the topic is also tackled in a current research project led 
by the University of Zurich!®. Even though the NLP and NEL technologies are 
constantly improving, this rather difficult task of disambiguation has not yet 
been solved in a way that it can be easily implemented in any domain. This 
aspect also affirmed the decision to proceed with a manual annotation process 
in this use case. 


Result. Parts of three constitutional documents were semantically annotated 
with DBpedia entities according to the criteria and method discussed above. The 
RDFa output created with refer was converted into NIF2 to ensure interoperabil- 
ity between language resources and annotations [15]. Overall, 1.175 annotations 
were created in three constitution documents using 218 distinct DBpedia enti- 
ties. This means that on average, each DBpedia entity was used around five 
times. Over all documents, 242 NIL annotations were used, which means that 
around 20% of all named entities in the documents were not in the knowledge 
base (or could not be found). All annotations and a list of NIL annotation surface 
forms is presented on Github!!. 


10 http: //www.cl.uzh.ch/en/research/completed-research/hist-temporal-entities.html, 
last accessed: May 12, 2019. 
11 https: //github.com/tabeatietz/semsoc, last accessed: May 12, 2019. 
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Lessons Learned. Overall, it can be concluded that semantic annotations sig- 
nificantly improve the reproducibility of the research process, especially using 
ontologies like NIF2 or Open Annotation [36], because each conclusion drawn 
from the annotation (or coding) process can be proven directly in the annota- 
tion document up to character level. Data re-use is also ensured, especially if the 
annotation criteria are listed in the research process. The created annotations 
may be re-used in form of RDFa, useful for HTML pages, or NIF2 useful for 
querying and further adaptation. In general if the annotations are created thor- 
oughly, they can furthermore function as a gold standard for computer scientists 
to improve and test NEL systems, especially with regard to the annotation crite- 
ria mentioned above. However, the process also revealed insufficiencies in terms 
of the underlying knowledge base, language problems and process automation. 
In the following, these shortcomings are listed and discussed with the goal to 
stress on their importance in future research work. 


1. Knowledge Graph: Choosing DBpedia to annotate constitution documents 
seems reasonable, because the text corpus deals with constitutions, i.e. coun- 
try specific information and facts about state leaders. These topics are gener- 
ally well represented in Wikipedia. However, for 20% of all annotations NIL- 
entities were used. Therefore it can be concluded that solely using DBpedia 
is not enough for a profound annotation of these documents. One reason for 
this may be the systemic bias in Wikipedia [33]. It is easy to imagine that 
this problem does not only exist for constitution related documents but for 
a broad range of topics and domains. In general, solving this problem in the 
long term is crucial to enable sociologists to reliably use the knowledge base 
for their research process. In future work, also Wikidata should be tested as a 
knowledge base sufficient for the analysis, but to the best of our knowledge we 
could not find a user interface for annotating text with Wikidata items similar 
to refer. However, sociologists also need to partake in the process of creating 
knowledge graphs which fulfill their annotation needs. This way, the entire 
Linked Data community can benefit from this interdisciplinary approach as 
well. 

2. Language Issues: Most NEL systems are created for English language text. 
This is a major problem when large non-English text corpora have to by 
analyzed. If non-English cultural heritage content is supposed to be analyzed 
and understood by sociologists and in any other domain, this is an impor- 
tant research task for the future. One prominent automated NEL system for 
German language text is DBpedia Spotlight [7]. However, initial experiments 
with the system revealed that the annotations did not meet the criteria men- 
tioned above. Therefore it was eliminated from the research process. 

3. Historical Text: The fact that the corpus in the use case includes documents 
dating back to 1884 further complicates the annotation process. For instance, 
it was important to map entities according to their temporal role. So far, 
there is no NEL system available which allows to annotate these temporal 
roles in German language with a decent quality. Apart from the temporal 
role disambiguation in this work, one challenge this corpus also provides is 
the changing style of language in the documents over time. 
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Even though these insufficiencies evolved during the annotation of constitu- 
tional texts, the problems are generalizable to a broad range of cultural heritage 
data. If these open research problems are resolved, social scientists seeking to 
understand large text corpora are able to use semantic annotation systems for 
their analysis process in a more automated manner. 


4.3 Querying 


When analyzing historical content in sociology, its changes over time as well as 
their causes and effects with regard to the society in which they appeared are 
crucial information to be studied. These changes may appear in the structure of 
a document as well as in the content itself. In this section it will be discussed 
on the example of the use case, how the previous data modelling and semantic 
annotation supports the analysis process. For this purpose all previously gener- 
ated data was imported into the Blazegraph triple store!” to be queried using 
SPARQL. 


Time Based Analysis. Figure3 visualizes an example of the analysis process 
which is enabled by previously modeling the data as Linked Data and querying. 
The different constitution editions are placed on a timeline along with infor- 
mation on structural changes and DBpedia context information. Constitutional 
documents follow a strict formal hierarchy. Each document is organized into sev- 
eral units, being the chapters, paragraphs, articles, and sections. A constitution’s 
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YY 


William- 
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Fig. 3. Timeline of constitution editions, chapter numbers and context information 
(Color figure online) 


12 https: //www.blazegraph.com/, last accessed: May 13, 2019. 
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chapter as the top level structural unit sets the entire framework of the consti- 
tution. Therefore querying and visualizing the changes of chapter numbers in 
constitutions (cf. red line) already reveal significant changes made in each docu- 
ment and allow the sociologist to focus on specific editions in the further analysis. 
Via federated querying, context information can be integrated into the process. 
In this case, information on the respective Dutch monarch was integrated via 
DBpedia, which may provide hints on the causes or effects on constitutional 
changes for further investigation. 


Knowledge Graph Structure. Linked Data enabled sociological content anal- 
ysis is especially useful when DBpedia entities are not only included into the 
analysis to widen the context, but also the underlying graph structure is uti- 
lized, as visualized in Fig. 4. In constitution texts, the monarch is named “King” 
at all times. Even if the monarch was a women (Queen). In sociology, the infor- 
mation of the monarch’s gender is vital [5]. With the temporal role annotations 
as described in Sect. 4.2, the respective constitution editions can be aggregated in 


@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> . 
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> . 
@prefix s: <https://github.com/tabeatietz/semsoc/> . 


@prefix dbr: <http://dbpedia.org/resource/> . 
@prefix dbc: <http://dbpedia.org/category/> . dbr:Ulrika_Eleonora, Queen_of_Sweden 
@prefix co: <http://www.constituteproject.org/ontology/> . 


@prefix skos: <http://www.w3.org/2004/02/skos/core#> . 
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dbc:Dutch_monarchs 


"male" 
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dbr:Willem-Alexander, Prince_of_Orange dbr:Beatrix_of_the_Netherlands 
itsrdf.taldentRef Kenlg Konig itsrdf.taldentRef 
=. 


nif:anchorOf nif:anchorOf 


s:constitution_24 n_2016-constitution_t2_s2?char=5,10 s:constitution_18_n_1983-constitution_t2_s2?char=5,10 
nif:referenceContext nif:referenceContext 
s:constitution_24_n_2016-constitution_t2_s2 s:constitution_18_n_1983-constitution_t2_s2 


nif:isString nif:isString 


"§ 2. König und Minister" "§ 2. König und Minister" 


Fig. 4. In two separate constitution versions, “König” (King) was annotated with their 
respective DBpedia entity (Beatrix and Willem) which allowes to exploit the graph 
structure of DBpedia. 
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a more meaningful manner. When only taking into account the Dutch constitu- 
tion, this possibility seems rather unspectacular, but being able to aggregate all 
European constitutions according to the gender of the head of state emphasizes 
how useful Linked Data can be in this analysis process. 


Visual Aids. The RDFa enrichment created with refer enables to visualize 
additional information about annotated entities directly within the context of 
the document which has proven to be useful for the research process. When 
the annotated text is published within Wordpress (as it is the case with refer), 
the annotations are immediately presented in the document’s HTML code. On 
mouseover, a so-called infobox as shown in Fig.5 is displayed below the anno- 
tated text fragment. It contains basic information about the entity derived from 
DBpedia, e.g. a thumbnail and additional data from the entity RDF graph put 
in a table layout. When exploring an annotated document corpus of interest, 
sociologists can make use of these infobox visualizations to learn more about the 
data in front of them without having to leave the original context of the text. 
This can support a better understanding of the text, for instance if a certain 
term is unknown to them or, as shown in Fig.5, they want to learn about the 
temporal roles of entities. 
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Fig. 5. Infobox visualization of former Prime Minister Ruud Lubbers 


Discussion. Querying the documents for sociological content analysis with 
SPARQL revealed that the created data model and semantic annotations are 
immensely useful and allow to not only the aggregation of the data in the corpus 
on its own but also through the exploitation of DBpedia’s graph structure. Using 
SPARQL on a RDF dataset which is shared with the research community also 
enables to share each query which led to the respective results. To make these 
benefits available to a large number of sociologists, a task for interdisciplinary 
future work is to create effective interactive visualizations for content analysis. 
These visualizations can be timelines which also incorporate context information 
from an external knowledge graph as well as relationship visualizations. 
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This section demonstrated the benefits of applying Linked Data standards to 
the different tasks of content analysis in sociology. This involves data modeling 
and publishing, annotation and querying. Major open research problems include 
the extension and improvement of existing knowledge graphs, the improvement 
of NEL systems for non-English texts and the possibility to annotate entities with 
respect to their temporal roles. Furthermore, meaningful visualizations may be 
developed to enable a better scientific exploration for non-technical users. 


5 Conclusion 


Content analysis in sociology is a gateway to understanding cultural heritage 
data. While a number of methods evolved to contribute to this process of mod- 
eling, annotating and analyzing textual content, most methods lack sufficient 
standardization which results in a research process where the results are often 
not reproducible and the data cannot be reused. Linked Data may be one way to 
counter these problems. The goal of this paper was therefore to present and dis- 
cuss intersection points between Linked Data and content analysis in sociology. 
On the use case of historical Dutch constitutional documents, it was shown how 
Linked Data can enhance the entire research process by modeling and distribut- 
ing research data in RDF, by semantically annotating texts e.g. with DBpedia 
entities and by querying the documents using SPARQL. One contribution of 
this paper is to provide lessons learned from the process, which revealed impor- 
tant and interesting open problems to be solved in interdisciplinary research 
between Linked Data experts and sociologists. Finally, it became apparent that 
in order to better understand cultural heritage data and its meaning for soci- 
ety, the Linked Data research community is challenged to support sociologists 
in improving their research to be more transparent, reproducible and re-usable. 
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Abstract. This paper describes the LinkedSaeima dataset that contains struc- 
tured data about Latvia’s parliamentary debates from 1993 until 2017. This 
information is published at http://dati.saeima.korpuss.lv as Linked Open Data. It 
is a part of the Corpus of Saeima (the Parliament of Latvia) released as open data 
for multidisciplinary research. The data model of LinkedSaeima follows the data 
structure of the LinkedEP dataset with a few modifications. The dataset is 
augmented with links to the Wikidata knowledge base that provide additional 
information about the speakers and named entities mentioned in the corpus. 


Keywords: Linked Open Data - Parliament debate corpus - Named entity 
linking - Open government data - RDF 


1 Introduction 


To ensure transparency of political and legislative processes, parliament proceedings 
and debate transcripts are usually made public. Saeima — the Parliament of the Republic 
of Latvia — publishes plenary transcripts on its website as unstructured text’. In 2016 
we published this as a text corpus with speaker annotations and other metadata [1]. 
With the increasing availability of corpora in different languages we realized that 
unannotated corpora are not enough to address various researchers’ needs such as 
comparative research across multiple languages. The 2018 release of the Corpus of 
Saeima attempted to address this concern by adding multiple additional annotation 
layers including named entity mentions, automated English translation and mor- 
phosyntactic information for linguistic analysis [2]. This release is available in multiple 


| http://www.saeima.lv/lv/transcripts/category/21. 
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commonly used formats: as a text corpus in NoSketch query software’, as syntactically 
parsed data and as Linked Open Data [3]. 

This paper describes LinkedSaeima® — a Linked Data representation of the Corpus 
of Saeima containing structured information about Saeima proceedings and the entities 
mentioned in the proceedings, represented using Wikidata identifiers [4]. Linked Data 
allows us to represent structured information about parliamentary debates by describing 
the properties of the objects from the domain of parliamentary meetings and relations 
between these objects. 


2 Parliamentary Speech Corpus 


The source of data for this corpus is the Saeima website that contains transcripts of all 
parliament sessions in text format. These transcripts are processed using a semi- 
automatic pipeline to identify the boundaries of speeches and the speakers. 

The Corpus of Saeima contains information about debates from seven parliamen- 
tary terms (5th—12th Saeima) covering years 1993-2017. The transcriptions of this 
corpus contain 38 million tokens and 497 thousand utterances. The available metadata 
for each utterance includes the date and type of the parliamentary session and speakers’ 
names and affiliations. A subset of speeches, starting from 2015, were translated from 
Latvian to English using a neural machine translation system [5]. The unreviewed 
machine-generated translation is included in the corpus for quantitative analysis pur- 
poses and to aid searchability and understanding for international researchers. How- 
ever, the text quality of automated translation is not sufficient for qualitative analysis of 
the Saeima corpus. 

The named entities mentioned in this corpus were automatically linked to Wikidata 
as the entity knowledge base [4]. The named entity recognition system is based on a 
full text search of Wikidata entity names, extending these aliases by generating a 
heuristic list of alternative variants for organization and people names, and inflecting 
them through a custom Latvian phrase inflection system built upon the Latvian mor- 
phosyntactic tagger [6]. As the goal of named entity recognition was primarily to 
provide a mapping to Wikidata, no technical means were applied to recognize entities 
without relevant Wikidata entries, however, in order to improve the coverage of entity 
linking, Wikidata entries for historical members of parliament and other officials were 
created (if not already existant) and populated with data based on open access sources 
available from Saeima. For the purposes of disambiguation of entities with overlapping 
names, the most likely entity was chosen based on a cosine similarity metric with 
respect to structured Wikidata information extracts, adapting a system developed earlier 
for news corpora analysis [7]. 


? NoSketch interface for this corpus: http://dati.saeima.korpuss.lv/nosketch/. 
3 http://dati.saeima.korpuss.lv/. 
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3 LinkedSaeima Dataset 


This paper focuses on LinkedSaeima — the Linked Data representation of the Saeima 
speech corpus. The current version of the dataset, published in May 2019, consists of 
approx. 4.9 million RDF triples’. Since the original January 2018 release we have fixed 
the identified issues with its RDF representation and improved the usability of the 
human-readable view of the dataset. 

The dataset contains 497221 speeches (utterances) from 1293 parliament meetings. 
These speeches were given by 690 speakers with 162 speaker roles and contain 392530 
mentions of 2998 unique Wikidata entities. It includes information about the following 
classes of objects: 


e Meeting (/pv_eu:SessionDay) — a top-level concept representing one parliament 
plenary meeting usually consisting of multiple Speeches; 

e Speech (/pv_eu:Speech) — an individual speech (utterance) given at a Meeting by a 
single Speaker in a particular Role; 
Speaker (/pv:Speaker) — a person giving a speech; 
Role (/pv:PoliticalFunction) — a role which the person represented when giving a 
Speech (e.g. the Prime Minister). A person may appear in multiple roles. 


http://dati.saeima.korpuss.|v/entity/speech/2015_02_05_284-seq43 


AN ENTITY OF TYPE: Speech 


Ipv:number 43 xsd:intege 
de:date 2015-02-05 xsd:date 
Ipv:spokenText Mans mērķis, kad es ierosinaju atteikties no dalības projektā, bija nedot iespēju izšķērdēt 


valsts līdzekļus. Un apstāties pie tā, kas jau šobrīd ir izdarīts [..] @lv 
dc:language Iv xsd:language 


Ipv:translatedText My goal, when | proposed to opt out of the draft membership, had withheld the possibility of 
wasting public resources and Un stop at what is already done [...] @en 


rdf:type Ipv_eu:Speech 
dcterms:isPartOf <http://dati.saeima.korpuss.|v/entity/meeting/2015_02_05_284> 
Ipv:spokenAs <http://dati.saeima.korpuss.|v/entity/role/103> 


4 Government of Latvia | Minister of Economy 


Ipv:speaker <http://dati.saeima.korpuss.|v/entity/speaker/Dana_Reizniece-Ozola-1981> 
4 Dana Reizniece-Ozola 


Fig. 1. LinkedSaeima information about a speech (in LodView browser). 
(http://dati.saeima.korpuss.1v/entity/speech/2015_02_05_284-seq43) 


* LinkedSaeima RDF dump is available at http://saeima.korpuss.|v/datasets/rdf/. 
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Figure 1 shows an example of a Speech. Its properties include date (dc:date), 
sequence number (/pv-number), spoken ([pv:spokenText) and translated (/pv-trans- 
latedText) text, and it is related to the SessionDay it is a part of (dct:isPartOf), to the 
Speaker (/pv:speaker), its Role (Ipv-spokenAs) and to the named entities recognized in 
the text. 


PoliticalFunction 
dc:title 


owl:sameAs 


det:isPartOf 


Ipv:spokenAs 


de:date 


dct:hasPart dc:language 
Ipv:number 


Ipv:spokenText schema: 


Ipv:speaker Ipv:translated Text mentions 


foaf:name 
foaf:gender 
dbp:birthYear 
dbp:deathYear 


owl:sameAs | Wikidata entity 


Fig. 2. The data model of the LinkedSaeima dataset. 


The data model of the LinkedSaeima dataset, shown in Fig. 2, follows the model of 
the LinkedEP project and the Linkedpolitics vocabulary used in it, referenced in this 
paper using vocabulary prefixes /pv and Ipv_eu [8]. The main innovation of this dataset, 
compared to LinkedEP, is the addition of named entity information, represented using 
the schema:mentions property pointing to entity Wikidata indentifiers. Another dif- 
ference is that we “materialize” speaker Roles extracted from the corpus by giving them 
URI identifiers that can be used for querying the dataset (e.g. for speeches by Ministers 
of Foreign Affairs) and linking them to other datasets. Speaker roles (Ipv:Politi- 
calFunction) may also contain links to matching entities in Wikidata. 

There is ongoing work for standardization of corpora of parliamentary proceedings 
based on TEI [9]. Our approach could be applied to other parliamentary speech corpora 
by implementing a transformation from the TEI standard once it is finalized in order to 
make these resources available as Linked Data. 


4 Data Access and Implementation 


The LinkedSaeima dataset can be accessed: 


e as Linked Data (published using LodView); 
e using a Triple Pattern Fragments server and user interface”; 
e as a single RDF file®. 


> http://dati.saeima.korpuss.lv/ldf/saeima. 
© http://saeima.korpuss.lv/datasets/rdf/. 
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The dataset is published as Linked Data, all its objects have HTTP URIs and 
information about them can be retrieved by looking up their URIs. The Linked Data 


interface is implemented using the LodView linked data browser’ that can serve data in 
RDF, HTML and multiple other formats. The URI patterns used in the dataset, illus- 
trated by examples, are listed in Table 1. 


In order to provide a lightweight query interface, the dataset is published using the 
Triple Pattern Fragments (TPF) server which provides a lightweight way for querying 
RDF datasets [10]. The dataset is also released as a single RDF file that researchers can 
use to run more complex queries and analysis. For example, Listing 1 demonstrates 


how researchers can use SPARQL to perform statistical queries on this dataset. 


Table 1. URI patterns used in the LinkedSaeima dataset. 


Type URI pattern 

Speech /entity/speech/2015_02_05_284-seq 43 
Speaker /entity/speaker/Dana_Reizniece-Ozola-1981 
Role /entity/role/103 

SessionDay | /entity/meeting/2015_02_05_284 


Listing 1. A query for the yearly statistics of speeches by the Minister of Foreign Affairs 


PREFIX lpv: <http://purl.org/linkedpolitics/vocabulary/> 


PREFIX lpv_eu: <http://purl.org/linkedpolitics/vocabulary/eu/plenary/> 


PREFIX saeima_role: <http://dati.saeima.korpuss.lv/entity/role/> 
PREFIX de: <http://purl.org/dc/elements/1.1/> 


SELECT ?year (COUNT(?speech) AS ?count) 


WHERE { 


?speech a lpv_eu:Speech . 


?speech lpv:spokenAs saeima_role:23 . 


?speech dc:date ?date . 


BIND (year(?date) as ?year) 


} 
GROUP BY ?year 
ORDER BY ?year 


7 LodLive linked data browser: https://github.com/dvcama/LodLive/. 
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5 Conclusions 


In this paper we described LinkedSaeima — a Linked Data representation of the dataset 
of Latvia’s parliamentary debates extended with NLP annotation layers. We hope that 
its Linked Data representation and the new annotation levels (entity references and 
translation) will allow researchers from other countries to use this resource in their 
studies, comparing Latvia’s parliamentary data with data from other national parlia- 
ments and to provide users with new ways of exploring this information. 

Expected future work includes extending the LinkedSaeima dataset with additional 
types of structured information, for example, voting data, and adding automated 
translations for the whole historical dataset. Improvements to entity recognition and 
morphosyntatic tagging are being carried out as part of related research projects. 

By publishing this parliamentary corpus as Linked Open Data and by including 
links to Wikidata entities we hope to facilitate the development of a global network 
of linked political and legal information, and to provide an example to other 
implementers. 
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Abstract. The World Wide Web is one of the main ways of accessing 
knowledge in cultural heritage. Recently, several projects in digital humanities 
have emerged; however only a few are specialized in musicology. In this paper, 
we present MusicKG, a multilingual knowledge graph about medieval musi- 
cology and musical iconography. A specific ontology has been designed to 
integrate data from several iconographic and musicology databases. In addition, 
MusicKG is connected to the Linked Open Data cloud with a significant part of 
its classes, properties and instances being linked to Wikidata, Getty Vocabu- 
laries, MIMO, Iconclass and GeoNames. MusicKG is accessible and reusable by 
three means: a downloadable RDF dump, a Virtuoso faceted browser and a 
public SPARQL endpoint. Some representative SPARQL query examples are 
given to illustrate the scope of MusicKG and to show the potential impact on the 
research work in medieval musicology. 


Keywords: Knowledge graph - Linked Open Data - Ontology - Musicology - 
Cultural heritage - Musical iconography 


1 Introduction 


The conservation of cultural heritage is very important for humankind. Today, many 
cultural organizations and actors store and maintain cultural data in digital forms. 
Applications like virtual visit of museums and culture search portal have been devel- 
oped to provide digital experiences and interactions with cultural data. 

Semantic Web technologies have been used in the cultural heritage field since more 
than a decade. An important amount of semantic data models, vocabularies and 
knowledge graphs (KG) have flourished. On the data model and vocabularies side, we 
can mention CIDOC Conceptual Reference Model [3], Cultural-ON [8], Sampo [6] and 
the Getty vocabularies’. With its own data model [7], Europeana tries to facilitate the 


' http://www.getty.edu/research/tools/vocabularies/lod/. 
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discoverability of cultural resources by collecting the resources’metadata and by cen- 
tralizing them [5]. 

On the knowledge graph side, the Amsterdam Museum’s Linked Open Data 
comprises the entire collection of the Amsterdam Museum consisting of more than 
70,000 object descriptions [2]. ArCo? is a knowledge graph containing around 800.000 
catalogue records of Italian cultural heritage entities (ex. archeological objects, 
numismatic objects). In the music domain which concerns more directly our work, we 
can mention the LinkedBrainz* project that helps MusicBrainz (an open music ency- 
clopedia that collects music metadata) publish its database as Linked Data. Last, the 
DOREMUS knowledge graph [1] describes classical music works and their associated 
events (e.g. performances in concerts). The data come from three major French cultural 
institutions: the French National Library, Radio France and the Philharmonie de Paris. 

In this paper, we introduce MusicKG, a unique cultural heritage knowledge graph 
containing representations of sound and music in the Middle Ages. In Sect. 2, we 
describe the source data of MusicKG. In Sect. 3, we detail the ontology underlying 
MusicKG. In Sect. 4, we show how MusicKG is connected to the Linked Open Data 
cloud. In Sect. 5, we illustrate the data scope of MusicKG with several SPARQL query 
examples. Section 6 discusses the potential impact of MusicKG on the research in 
medieval musicology and concludes the paper. 


= David tuning his harp 


m 
$i; English title : David tuning his harp 


NO} Location (current) : Charleville-Méziéres, Grand Est, France 
iu qy Location (original) : Belval, Grand Est, France 

sopie Dates : 1101 -1200 

aa Technique : Illumination 

më Material : Paper / Parchment 

is Support : manuscrit 246 D, tome 1, folio 1 

ab Partner database : Initiale 

a Original title : David jouant de la harpe 

a Iconclass reference : 


nA fæya 11162(DAVID)33 David as psalmist, as author of the psalms 
Se 4807322 harp 


Fig. 1. King David tuning his harp - http://musiconis.huma-num.fr/en/fiche/39/x.html 


? hittp://wit.iste.cnr.it/arco/. 
3 http://linkedbrainz.org/. 
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2 Source Data 


The data in MusicKG comes from Musiconis, a database of musical iconographies 
created from several partner databases*: Musicastallis, Vitrail, Metropolitan Museum 
(NY), Mandragore, Initiale, Sculpture, Gothic Ivories, Et Stalla, and Romane. Each of 
the partner databases has its own specificity, generally related to the material support of 
its representations. For example, the Musicastallis database catalogs musical icono- 
graphic representations presented on the carved choir stalls of religious buildings. 
Currently, the Musiconis database contains 2154 iconographic representations whose 
scenes not only contain musical but also vocal, acrobatic or choreographic perfor- 
mances. These scenes are deeply analyzed and each instrument is described with 
organological details. Figure 1 depicts a Musiconis illumination representing King 
David tuning his harp. In this illumination of the character “B”, it is possible to observe 
many details: the number of strings, the tuning key, the characteristics and the detailed 
form of the instrument. 


3 MusicKG Data Model 


We follow the W3C recommendation about “Data on the web best practices” [4]. We 
reuse vocabularies and resources as much as possible, including Wikidata from which 
the P... and Q... items listed below are taken from. In this section, we present the 
MusicKG data model which depicts representations of sound and music in the Middle 
Ages. The main class of our model is Visual artwork (Q4502142) (herein “artwork’’) 


Architectural technique 
fabrication = n (maian) 
depicts depicted by V1UMPer image 


lepicts notable work 


[anyURI}«—described at URL PEA Ensemble of 
work of art 


material used titre start time 


Art material [string end time `: : pea] 
time period 
official website eh 
depicts iconclass D 


Fig. 2. The Visual artwork entity with its corresponding relations in our KG 


stated as 


4 http://www.plm.paris-sorbonne.fr/musicastallis/, http://e-chastel.huma-num.fr/xmlui/handle/ 
123456789/3, https://www.metmuseum.org, http://mandragore.bnf.fr/html/accueil.html, http:// 
initiale.irht.cnrs.fr, http://www.gothicivories.courtauld.ac.uk, https://www.ru.nl/ckd/databases/stalla/ 
introductie/. 
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which represents a visual artistic work or creation (see Fig. 2). Each Visual artwork 
instance (see example in Fig. 3) is connected to the original sources through several 
predicates: official website (P856), collection (P195), inventory number (P217) and 
described at URL (P973). Also, each artwork instance has a title from the Musiconis 
database and a title from its original database described by title (P1476) and stated as 
(P1932) respectively. 

Images are essential for iconographic data. Generally, several images (P18) are 
associated with an artwork to capture all the details from different angles and with 
different resolutions. Regarding dates, each artwork has three different properties: start 
time (P580); end time (P582) and time period (P2348) that indicate the century, the 
date on which the artist began and finished creating the artwork respectively. 

The class Visual artist (Q3391743) refers to the artist who made the artwork. An 
artwork is associated to its creator with the relation creator (P170). Each artist entity is 
portrayed with the properties birth name (P1477) and notable work (P800). In 
addition, we added two relations to each artwork instance: material used (P186) and 
fabrication method (P2079). In one hand, the relation material used describes the 
material an artwork is made of. This relation associates instances of artwork with Art 
materials (Q15303351) such as Wood (Q287) or Ivory (Q82001) for sculptures; 
Textile (Q28823) for embroideries and tapestry weavings; or Glass (Q11469) for 
stained glasses. On the other hand, the relation fabrication method relates an artwork 
with its Artistic technique (Q11177771), such as Sculpture technique (Q21711025) 
or Painting technique (Q1231896). In many cases, we have the information about the 
manuscript (Q87167) or the Architectural element (Q391414) to which a certain 
artwork belongs to. Examples of architectural elements are archivolts (Q636008), 
misericords (Q1938805), among many others. In those cases, we relate both entities 
through the relation part of (P361). 


mimo:<http://http://www.mimo-db.eu/InstrumentsKeywords/> 
wdt: <http:/www.wikidata.org/prop/direct/> 


wat:P217 


inventory number 331 


<http://musiconis.huma-num.fr/en/fiche/39/roi-david- 


accordant-une-harpe.htmi> OEA wd: <http://www.wikidata.org/entity/> wat "1101"Axsd:gYear 
‘ A K? aat: <http://vocab.getty.edu/aat/> fart time k 
<http://www.enluminures.culture.fr/Wave/savimage/ | watp18 © gns: <http:l/sws.geonames.org/> watPS82 T ugaggmsd:aY. 
enlumine/irht6/IRHT_096162-p jpg> image ic: <http://iconclass.org/rkd/> gnd time xsa-g Year 
; s wdt:P2348__|wd:Q7061 
<http://www.enluminures.culture.fr/Wave/savimage/ | wdtP18 century (12th century 
enlumine/irht6/IRHT_096163-p.jpg> image 
wd:Q4502142 "David tuning his harp"@en 


<http://initiale.irht.cnrs.fr/decors/decors.php?id=331> k 
P alli dt P973 artwork a "Roi David accordant une harpe"@fr 
— wd:Q11177771 i irdftype ah 
wd:Q22669085 piel technique RR H "El rey David afinando un arpa"@es 
Art of illumination at mkg:artwork39 Z dt:P1932. eae 
S ow\sameAs _ rdfttypes 9 j a wot. "David t de la h "@fi 
a—— se jouant de la harpe"@fr 
aat:300264522 mkg:technique3: ST goy CES SRR 
illumination owisameAs | Art ofilumination | yoi™™ wdt:P195 wat:P364 
i! collection part of 
wd:Q15303351 mkg. Initiale mkg:manuscript246D)"s:comment| "manuscrit 246 D, 
pag EAS Initiale colletion Manuscript 246 D tome 1, folio 1" 
. rdf:type | L&S rdf:type 1 1, rdf:type 
jra owļ:sameAs AN 
aat:300011851 mkg:material7 “ee wd:Q28870066 wd:Q3287092 
parchment owl:sameAs parchment ensemble of works of art manuscript 


Fig. 3. Representation of the artwork instance describing the example of Fig. 1. 
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4 Linking MusicKG to the Linked Open Data Cloud 


The singularity of MusicKG is its analysis of performances and the relationships 
between performances. This is one of the main contributions of our Knowledge Graph 
since, as far as we know, there are no other works that describe to this level of detail the 
relationships between entities within iconographic representations. Moreover, our 
model has been enriched with additional information coming from other popular 
Knowledge Graphs: Wikidata, Getty Vocabularies, Iconclass, MIMO and Geonames. 
Figure 4 shows an example of the interconnections between MusicKG and the afore- 
mentioned Knowledge Graphs. 


mimo:shttp://http://www.mimo-db.eu/InstrumentsKeywords/> wd:Q4502142 wd:Q160742 I wd:Q2326815 
wdt: <http:/Awww.wikidata.org/prop/direct/> aior abbey municipal library 
wd: <http://www.wikidata.org/entity/> g 


hind oN 7 7 
aat: <http://vocab.getty.edu/aat/> rdftype’ wh i rdf.type 1 rdfitype 

gns: <http://sws.geonames.org/> ypc" PAS 7 wdtP1566 F 

ic: <http://iconclass.org/rkd/> awe GeoNames ID gns:3033556} 


ic:11162(DAVID)33 


Wat wat:P 1566 ; 
David en psalmiste, auteur wd:Q35140 EE een loray ò GeoNames I0 9NS:3026613 
des psaumes, généralement performance depicts 7 

jouant de la harpe wd:Q3095 ae palates m 


instrument 


wdtP106 wd:Q639669 


occupation musician 


Hornbostel-Sachs classification 


Fig. 4. Example of links between MusicKG and external KGs on the LOD cloud 


We used the Wikidata entity Q35140 to represent performances. All performances 
are related to one or several performers (Q16010345) through the relation performer 
(P175) and practiced by (P3095) respectively. In our KG there are instrumental, vocal, 
choreographic and acrobatic performances. In the case of instrumental performances, 
the relation instrument (P1303) is used to associate a performance with the instru- 
ment (Q3095) played. 

The property occupation (P106) establishes a relationship between a performer 
and a profession (Q35140) that represents their occupation or the activity they perform 
in the artwork. Some of the sixteenth century professions represented in our Knowl- 
edge Graph are: acrobat, singer, dancer, animal trainer, conjurer, juggler, pedagogue or 
partition holder. In addition, a performer may be an instance of (P31) an animal 
(Q729), adult (QQ9584157) or mythical entity (Q24334685); or have a sex or gender 
(P21) such as male (Q6581097) or female (Q6581072). 
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5 SPARQL Query Examples 


MusicKG can be accessed by three means: a downloadable data dump, a Virtuoso 
faceted browser and a public SPARQL endpoint. All the information and links are 
available online®. MusicKG is oriented towards visual artworks, performances, per- 
formers and instruments. Users can specify techniques, materials, historical periods, 
etc. To illustrate the data we may retrieve from MusicKG, in Table 1, we provide two 
representative examples with their associated SPARQL query and result. 


Table 1. Representative competency questions, SPARQL queries and results 


Example SPARQL query Result 

Artworks using the SELECT ?visualArtwork Musiconis100: “Two musicians 

marquetry technique WHERE { playing the lute and the transverse flute” 
?visualArtwork rdf:type wd:Q4502142 . Musiconis241: “Two bagpipe players & two 
?visualArtwork wdt:P2079 ?technique . dancing dogs” 
?technique skos:exactMatch wd:Q1049923 .} 

Artworks depicting a rabbit | SELECT ?visualArtwork Musiconis299: “Rabbit playing the trumpet 

playing the trumpet WHERE { astride a naked man” 


?visualArtwork rdf:type wd:Q4502142 . 
?performance wdt:P1299 ?visualArtwork . 
performance wdt:P1303 ?trumpet . 
?trumpet skos:exactMatch wd:Q8338 . 
?performance wdt:P175 ?performer . 
performer wdt:P31 ?rabbit . 

?rabbit skos:exactMatch wd:Q9394 .} 


6 Conclusion and Future Work 


In this paper, we presented MusicKG, a multilingual cultural heritage knowledge graph 
containing representations of sound and music in the Middle Ages. We presented 
respectively the source data, the ontology data model, how it is connected to external 
sources in the Linked Open Data Cloud and representative SPARQL queries. MusicKG 
may have a great impact on the research in medieval musicology, and more particu- 
larly, in musical iconography. The SPARQL endpoint allows to make more precise 
queries and to retrieve more accurate results. Furthermore, the connection with the 
LOD cloud may bring several benefits that we envisage exploiting. Wikidata items 
have known multilingual labels and aliases. We plan to retrieve these data more 
exhaustively to enable the multilingual display of the knowledge graph. We consider 
making the MusicKG searchable in multiple languages so that more people can access 
easily this unique cultural heritage database. As Wikidata is becoming the central hub 
for cultural heritage datasets with lots of institutions publishing their catalogue data, we 
will study the ingestion of MusicKG into Wikidata. 


> https://github.com/victoriaeyharabide/MusicKG. 


MusicKG: Representations of Sound and Music 63 


References 


1. Achichi, M., Lisena, P., Todorov, K., Troncy, R., Delahousse, J.: DOREMUS: a graph of 
linked musical works. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 3- 
19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_1 

2. de Boer, V., et al.: Amsterdam museum linked open data. Semant. Web 4(3), 237-243 (2013) 

3. Doerr, M.: The CIDOC conceptual reference module: an ontological approach to semantic 
interoperability of metadata. AI Mag. 24(3), 75 (2003) 

4. Farias Léscio, B., Burle, C., Calegari, N. (eds.): Data on the Web Best Practices. W3C 
Recommendation, 31 January 2017. https://www.w3.org/TR/dwbp/ 

5. Freire, N., Meijers, E., Voorburg, R., Isaac, A.: Aggregation of cultural heritage datasets 
through the Web of Data. Procedia Comput. Sci. 137, 120-126 (2018) 

6. Hyvönen, E.: Cultural heritage linked data on the semantic web: three case studies using the 
sampo model. Artium, Vitoria-Gasteiz, Spain, 19-20 October 2016 (2016) 

7. Isaac, A., Haslhofer, B.: Europeana linked open data—data.europeana.eu. Semant. Web 4(3), 
291-297 (2013) 

8. Lodi, G., et al.: Semantic web for cultural heritage valorisation. In: Vrandečić, D., et al. (eds.) 
Data Analytics in Digital Humanities, pp. 3-37. Springer, Cham (2018). https://doi.org/10. 
1007/978-3-3 19-54499-1_1 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appro- 
priate credit to the original author(s) and the source, provide a link to the Creative Commons 
license and indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly 
from the copyright holder. 


Machine Learning and Deep Learning 
Techniques 


S 


Check for 
updates 


Improving NLU Training over Linked 
Data with Placeholder Concepts 


Tobias Schmitt!, Cedric Kulbach?, and York Sure-Vetter!?(=) 


1 Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany 
tobias.s.schmitt@web.de, york.sure-vetter@kit.edu 
? FZI Research Center for Information Technology, Karlsruhe, Germany 
kulbach@fzi.de 


Abstract. Conversational systems, also known as dialogue systems, 
have become increasingly popular. They can perform a variety of tasks 
e.g. in B2C areas such as sales and customer services. A significant 
amount of research has already been conducted on improving the under- 
lying algorithms of the natural language understanding (NLU) compo- 
nent of dialogue systems. This paper presents an approach to generate 
training datasets for the NLU component from Linked Data resources. 
We analyze how differently designed training datasets can impact the 
performance of the NLU component. Whereby, the training datasets dif- 
fer mainly by varying values for the injection into fixed sentence patterns. 
As a core contribution, we introduce and evaluate the performance of dif- 
ferent placeholder concepts. Our results show that a trained model with 
placeholder concepts is capable of handling dynamic Linked Data with- 
out retraining the NLU component. Thus, our approach also contributes 
to the robustness of the NLU component. 


Keywords: Natural Language Understanding - Named Entity 
Recognition - Chatbots - Linked Data 


1 Introduction 


Modern conversational systems, also called dialogue systems (DS), are gaining 
access into peoples day-to-day lives and are offering an increasing number of 
services, especially known to the public audience in the form of chatbots. The 
standard DS consists of three components: the Natural Language Understanding 
(NLU) component, which identifies the meaning behind the incoming message 
and extracts relevant parts called entities, the Dialogue Manager (DM), which 
determines the corresponding action based on the output from the NLU, and 
the Natural Language Generator (NLG), which generates the response that is 
transmitted to the user [11]. 

In DS the NLU component mostly uses standard concepts from Natural Lan- 
guage Processing (NLP) tasks. It mainly consists of an intent classifier and 
a named entity recognition (NER) component. Both components make use of 
© The Author(s) 2019 
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machine learning technologies, which mostly need to be trained supervised (s. 
Sect. 1.1). More and more data is published as Linked Data, which forms a 
suitable knowledge base for NLP tasks. In the context of chatbots a key chal- 
lenge is developing intuitive ways to access this data to train an NLU pipeline 
and to generate answers for NLG purposes. Using the same knowledge base for 
NLU and NLG provides a self-sufficient system. An NLU component identifies 
the intents and entities which the NLG component requires for generating the 
response. However, the challenge becomes apparent when the knowledge base 
changes and the already trained NLU model deteriorates in the detection of 
intents and entities. Training on more general training data could avoid compu- 
tational expensive retraining and make the NLU component more robust against 
changes in the knowledge base and unclear requests. In this context, we define 
the robustness of an NLU through the metrics of the NLU on not yet seen entity 
values. As a more general approach to create appropriate training data for the 
NLU we propose the placeholder concept where placeholder values are used as 
entity values instead of real ones taken from a related knowledge base. These val- 
ues are then filled into predefined sentence patterns to generate the final dataset 
for training the NLU components. As a key result, we show which type of entity 
values (placeholder or database values) work best for training a NER, algorithm 
or an intent classifier. 

In a first step, we present the typical process that can be used when designing 
an NLU in the chatbot context. After a motivating example in Sect.1.2 the 
procedure for the construction of training data for an NLU pipeline (Sect. 2) is 
shown. To compare the performance of the two conceptual approaches to create 
the NLU training dataset, we created a set of experiments that are described in 
Sect. 3. After evaluating the performance results of the conducted experiments in 
Sect. 4, we bring the paper into the context of related work (Sect. 5). An outlook 
is given in Sect. 6. 


1.1 NLU in Chatbot Context 


In current DS architectures the NLU component is the most critical component 
to the success of chatbots or question answering (Q&A) [6]. It aims to identify 
the meaning behind the user’s input and extracts all the custom entity values in 
the incoming utterance [23]. Identifying the intent of the interlocutor is a classifi- 
cation problem that can be solved using supervised machine learning techniques. 
Available classifiers include Support Vector Machines (SVM) [3,13], deep neural 
networks [18,19] and embedding models [24]. The classifier is trained to predict 
to which of the learned intent classes the incoming utterance belongs to and to 
assign this label to the utterance so that it can be used by the next component 
[20]. All the intents that the system shall be able to match to user inputs have 
to be included in the training dataset. If the user input does not correspond to 
any of the learned intent labels, the model will still match it to one of them [16]. 
In closed domain DS this behavior leads to a chatbot that will answer every 
question, which must be taken into account during the creation process of the 
training dataset. 
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The second task of the NLU is to extract custom entities using sequence- 
labeling techniques. Conditional Random Fields (CRF) and Recurrent Neural 
Network (RNN) are most commonly used to label each unit in an utterance to 
determine the words that correspond to each of the learned entity types [10]. 
This is achieved by extracting features from the surrounding words (context) so 
that the system can predict not only the entity values present in the training data 
but also new values that users might use in their messages. Both components 
form an NLU. Examples of available NLUs include Microsoft’s LUIS,' IBM’s 
Watson? and RASA’s NLU? [4]. 


” text” : ” Where is the lecture Web Science taking place?” , 
” intent” : ” location_of_lecture ”, 

” entities” : | 

{ 

” start” 21, 

”end” : 32, 

*value” : ” Web Science”, 

” entity”:  ” lecture” 

3] 


Listing 1.1. Example of a labelled utterance used to train the intent classifier and 
entity extractor of the NLU. 


Example of a data point that can be either used for training or testing the NLU 
is presented in Listing 1.1 (s. Sect.2). To generate the training data like the 
one shown, we create a set of utterances related to each intent and integrate the 
entity values from either a knowledge base or placeholder values at the designated 
places. This approach not only provides a semi-automated way for generating 
training datasets, such as the one depicted in Listing 1.1, it further provides the 
first step towards an integration of Semantic Question Answering (SQA) [21] 
tasks into chatbots. By generating training data as described in this work, the 
aim is to analyze how well a system can be trained if little or no information 
is available about the entity values that users might use in their utterances. In 
summary, we provide contributions to the following questions: 


RQ 1 Which type of entity values work best for training the entity recognition 
algorithm? 

RQ 2 Which type of entity values work best for training different intent classi- 
fiers? 

RQ 3 How can linked data improve NLU performances? 


1.2 Motivating Example 


We describe an example that motivates our approach and experiments. The 
handbook for the study program Industrial Engineering and Management at 


1 https://www.luis.ai/home, accessed on 11.12.2018. 

? https: //console.bluemix.net /developer/Watson/documentation, accessed on 
11.12.2018. 

3 https://rasa.com, accessed on 11.12.2018. 
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KIT is publicly available as a .pdf version. In order to make this information 
accessible by a computer program, such as a dialogue system, the relevant data 
were extracted and transformed to RDF. The domain of a DS trained with 
the RDF triplestore is defined by lectures, lecturers, location and semesters, 
where each person(lecturer) can lecture a lecture from a specific module in a 
given room/building (location_of_lecture) at a given date/semester. To answer 
a question like ‘Where is the lecture Web Science taking place’ (Q1) the NLU 
needs to detect the intent location_of_lecture and the entity lecture with the 
value ‘Web Science’. Remarking that the question for a location is related to 
the entities found in the question (in Q1 lecture). This problem is addressed by 
relation linking (RL) [7]. Before the RL problem can be resolved, however, it 
must first be ensured that the correct intents and entities are found. To train 
the NLU a set of utterances for each intent is defined (s. Listing 1.1). In a 
closed domain DS the entries from the knowledge base can be used to generate 
utterances by replacing the entities (e.g. “Web Science” in Listing 1.1) from 
the utterances with the entries from the knowledge base. For example, with the 
help of the sentence pattern ‘Where is the lecture lecture taking place?’ and the 
knowledge base, data points can be generated automatically from the lectures 
property. Whereby, lecture is a placeholder for the lectures entries from the triple 
store or other values. We call this concept the domain or placeholder concept 
(s. Sect. 2.1). The results are multiple data points with the same structure, but 
different entities. Taking into account that the entity values (i.e. for the entity 
lecture) can change over time, the NLU has the task of identifying intents and 
entities that did not exist before. We address this problem by providing a robust 
NLU (definition in Sect. 1) from the beginning. 


2 Construction of Training Data 


In the first part the general design approach is described before presenting a 
holistic approach that can be used to systematically create a DS and its matching 
training dataset. 


2.1 Training Data Design Approaches 


In this work, we aim to optimize the performance of the two tasks of the NLU 
(Intent classification and entity recognition) by optimizing the dataset that is 
used to train the system. Therefore we created two different design approaches 
that can be applied to create the training dataset for a domain-specific NLU. In 
this work, we focus on how the performance of the trained NLU is impacted if 
different types of entity values are used to create the training dataset. 

Before going into the specifics of the approaches it has to be noticed that 
the utterance patterns have to be created. This is necessary so that the entity 
values can later be filled in automatically. For each of the defined intents, a set 
of utterances have to be created, where each one contains one or more entity 
values that the system shall learn to detect. Because we want to be able to insert 
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different types of values automatically, an empty slot of matching type is inserted 
at the position where an entity value shall be inserted during the creation of the 
training data. Looking at utterance from the motivating example, we replaced 
the value of type lecture (Web Science) by an empty slot of type lecture (‘Where 
is the lecture {lecture} taking place’). Now we are able to insert different types 
of entity values into the utterances without having to change the utterances 
manually. Both approaches use the same utterances for each intent but are filled 
with different kinds of entity values. 

The two approaches described in the following are called the Domain Con- 
cept and Placeholder Concept. As the names suggest, we used entity values 
from a related knowledge base to create the training dataset within the database 
concept and placeholder values in the placeholder concept to create the dataset 
for training the NLU. 

Looking at the domain concept, it can be seen that the related knowledge 
database is queried for each of the defined entity types with the goal to extract 
all available values and store them into a list. These values are then used to fill 
the empty slots in the utterances, with respect to the entity type restriction. 
Table 1 shows how both concepts work and further depicts an example for each 
of them. The example shows how one of the entity values of type lecture is used 
to fill the empty slot of matching type in the example utterance. This utterance 
together with the appropriate labels can then be used to train the component 
of the NLU. 


Table 1. Conceptual approaches used to create the dataset for training and testing 
the NLU of the task-oriented component-based dialogue system. 


r P Domain Concept Placeholder Value Concepts 
once 
P (DM Concept) Identical PI. Values (PH Type 1) Different PI. Value (PH Type 2) 
Entity value Domain Entity One random character Random character sequence 
generation Values sequence for all entity types for each entity type 
Entity Type | Entity Values Entity Type | Entity Values Entity Type | Entity Values 
List of entity Type1 [List of values] Type1 Type1 Random Value 1 
values Type2 [List of values] Type2 Random Value Type2 Random Value 2 
Utterance <beginning of utterance> <entity type x> <...> <entity type y> <end of utterance> 
Entity Type: lecture Entity Type: lecture Entity Type: lecture, ... 
Entity Values: [Web Science, ...] Entity Value: x Entity Value: v 
Example SS SS 


Where is the lecture <lecture> taking place? 


The second approach is called Placeholder Concept and refers to the fact 
that instead of real values, taken from some knowledge database, placeholder 
values are inserted into the utterances to create the dataset for training the NLU. 
In general, the placeholder values are values which consist of one or multiple 
random words of varying length. The random words used in this work have 
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e.g. been created by randomly selecting one or multiple letters from the English 
alphabet. Within this concept, we followed two different ways of creating the 
dataset. The first one called Identical Placeholder Values Concept (PH 
Type 1). As the name suggests in this approach only one random value is 
created and used to fill all the empty slots in the utterances regardless of the 
entity type. In the example shown in Table 1 the letter x was selected to fill all of 
the empty slots. The second approach is called Different Placeholder Value 
Concept (PH Type 2). In this approach different random values are used to 
fill the different types of empty slots. For each of the defined entity types, one 
unique random value is created and used to fill the corresponding empty slots. 

With the experiments described in Sect.3 we aim to determine which design 
concept is best for training a domain-specific NLU. Based on the design specifi- 
cation of the concepts it can be assumed that if a dataset is created that contains 
all available entity values the results are likely to be highest. 

In the next part of the section, we introduce a holistic approach that can be 
used to create the dataset matching the requirements of a domain-specific NLU 
of a task-oriented DS. 


2.2 Training Data Creation Process 


In this subsection, we describe an approach that can be used to design the NLU 
of a task-oriented DS and to create a dataset matching the requirements. The 
complete approach is depicted in Fig. 1 and is based on the procedure described 
by Grotz [8]. 


Domain 
Specification 


Customizing 
the NLU 


Creating the Labelled Datasets 


9 


Creation of the Labelled 
Dataset for Training and 
Testing the NLU 


"text": "Where is the lecture Web 
Science taking place?", 
"intent": " location_of_lecture ", 
"entities": [ 
{ "start": 21, 
"end": 32, 
"value": ,Web Science", 
"entity": ,lecture" }] 


Creating a 
List of Intent 
Labels 


Defining the DS 
Functionalities 


Creating a List of Example 
Utterances for each Intent 


Where is the lecture 
<lecture> taking place?, ... 

Which lectures can | choose 
to attend in <semester>?, ... 


Find location of 
lecture 
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lectures_in_ 
semester 
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the required 
Entity Types 


lecture, | Pett Fenn Folate ree ieee 
semester Selection of Entity Type Values į 
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Values for each Entity Type 


I 
I 
Domain i 
1 
l 


Concept 


Fig. 1. Process for designing a customized NLU and creating the corresponding labeled 
dataset to train and test the system 


The process consists of six processing steps which can be categorized into 
three areas. The first area focuses on defining the functions/tasks of the DS. 
The processes in the second area are to derive a set of intent and entity type 
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labels that the NLU needs to be able to assign to an incoming utterance. In the 
processes of the last area, the previously defined intents and entity types are 
used to create a matching dataset for training (and testing) the NLU. 

The first area is called Domain Specification and consists of one process 
during which a set of functions/tasks are defined that the dialogue system shall 
be able to handle. According to Grötz [8], it is recommended to start with a 
small set of functions and to use the collected experience over time to improve 
them and to successively add new ones. In this work, we created the NLU of a 
DS which aims to support students at the KIT in acquiring information related 
to their study program (s. Sect. 1.2). Two of the defined functions, depicted in 
Fig. 1, aim to find the location where a certain lecture takes place and to identify 
lectures that take place in a specific semester. In our approach, the information 
required to answer the students’ questions are stored within an RDF knowledge 
base. SPARQL queries are used to extract the demanded information from the 
incoming question. 

The second area is called Customizing the NLU during which a list of 
intent labels and entity type labels has to be defined that the NLU shall be able to 
assign to the incoming utterances. Within the second process step called Creating 
a List of Intent Labels, one intent label is created for each of the previously 
defined functions. Following the example depicted in Fig.1, one intent label is 
created for each of the two functions. The intent label related to the first function 
is called location_of_lecture and the one related to the second function is called 
lectures_in_semester. In the third step, the types of entity values are determined 
which the NLU needs to be able to extract from the incoming utterances. These 
values are required to perform the functions defined in the first process step. The 
types of entity values that the NLU has to be able to extract can e.g. be derived 
from the underlying SPARQL queries. This is essential since the entity values 
are required to perform the query in order to retrieve the demanded information 
from the knowledge base. In the presented example, entity values of type lecture 
or of type semester are required to execute the underlying query. 

After having defined the required parameters, the process steps within the 
third area focus on creating an optimal dataset for training the NLU. Within 
the fourth step, a list of utterances is created for each of the defined intents 
following the procedure described in the previous section. At the positions in the 
utterances where an entity value of a certain type shall be inserted, an empty 
slot of matching type is placed. Furthermore, the utterances have to match the 
language usage of the target users (e.g. formal or informal) [8]. In Fig. 1 one 
utterance for each of the two intents is depicted where each includes one of the 
two defined entity types. In the sixth step, a list of entity values for each type is 
created that is then used to fill the empty slots in the utterances in order to create 
the final dataset. As explained in the previous section there are two approaches 
that can be applied for replacing the empty slots in the utterances. The first 
one is depicted in step 5.1 where a list of ‘real’ entity values is extracted from a 
related knowledge base. As described in Sect. 1.2 we created a RDF knowledge 
graph that contains all information related to the industrial engineering and 
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management study program at the KIT. The second option is to use placeholder 
values instead of real values. One value is assigned to each of the entity types, 
which can be either identical or different as shown in Table 1. 

In the last process step the empty slots in the utterances from step 4 are 
replaced using one of the lists created in step 5. At last information about the 
two sets of labels are added to each utterance. This includes the intent label, the 
entity type, the entity value and the position at which the entity values can be 
found in the utterance. This information is stored in one of the formats such as 
JSON. 


3 Experiments 


Based on the previously introduced approach we created a task-oriented NLU 
to determine which of the approaches from Subsect. 2.1 is best for training such 
a system. In the first part, we describe the development of the training datasets 
which were used to train the NLU, which we then evaluated to compare the 
performance that can be achieved by following the different design approaches. 
The applied pipeline of the NLU is described as part of the state of the art 
within the context of related work (s. Sect. 5). 


3.1 Creation of Domain Specific Dataset 


In order to evaluate the different approaches previously described, we created 
several datasets to train the NLU of the DS introduced in Sect. 1.2. Following 
the process from Sect. 2.2 we first defined the functionality of our DS and used 
these to derive a set of intents and entity types for creating the NLU. Next, we 
created a set of utterances with empty slots for each intent and created three 
entity type lists with different values to fill the empty slots. In the last part of 
the section we describe the experimental datasets used to evaluate the design 
approaches from Sect. 2.1 


System Specification and Creation of Utterances. Following the process 
described in Fig. 1 we defined 16 functions that our DS shall be able to perform. 
For the configuration of the NLU, we created one intent label per function, 
which the intent classifier shall be able to assign to incoming utterances after 
training. In addition, we derived the types of entity values that are required to 
perform the succeeding processing step, such as making a database inquiry (not 
realized in this work). In total, the NER component of the NLU needs to be able 
to recognize and extract six different types of entity values. An extract of the 
complete list of the intents and the corresponding entity values can be seen in 
Table 2. The first column shows the name of the intent and the last column the 
entity value type that is required for further processing. 

Furthermore, the table shows how many utterances have been manually cre- 
ated for each intent. As described in Subsect. 2.1 we inserted empty slots at the 
position in the utterance where one of the entity values shall be included in 
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the final step. In total 299 utterances were created, which were split into a train 
(80%) and a test (20%) set. These utterances are used by both design approaches 
to create the final datasets for training and testing the NLU. 


Table 2. Number of training utterances created for each intent and their corresponding 
entity types. 


Intent Utterance Entity type 
Train (80%) | Test (20%) | Combined 
lecturer_of_lecture 18 5 23 lecture 
lectures_in_the_current_semester | 8 3 11 semester, subject 
semester_of_lecture 8 2 10 lecture, semester 
subject_of_lecture 8 2 10 lecture, subject 
modules_within_subject 16 4 20 subject 
sub ject_affiliation_of_module 16 5 21 module 
location_of_room 19 5 24 building 
office_of_lecturer 16 4 20 person 
Total 234 65 299 
Entity Values. As explained in Sect. 2.1 there are two options to replace the 
empty slots with a corresponding entity value. Following the domain concept, 
we extracted all values related to each of the six entity types from a related RDF 
file as explained in the motivating example. The values were retrieved by using 


one SPARQL query for each type, which was then stored into a list. One list was 
created for each entity type where all matching values were stored. As with the 
utterance, each entity list was split into a training and testing set. The combined 
set included all values found. Table3 depicts the number of values found in the 
RDF file which relates to one of the six entity types. The second column in the 
table indicates how many empty slots in the utterances exist, which need to be 
filled in order to create the final dataset. Having extracted all possible values 
that the system needs to be able to recognize, the empty slots were replaced by 
looping through the created entity list and filling in a value of the matching type 
into the existing utterances. If there were more empty slots than unique entity 
values, some values were used more then once which were selected randomly. If 
there were more unique entity values than empty slots, some utterances were 
used more than once. In that case, we randomly selected a matching number 
of utterances from the list of utterances that only have an empty slot of that 
specific type. Those were then used to fill in the remaining utterances to finish 
the replacement process. 

To create the utterances following the placeholder concept we created two 
sets of placeholder values. The type 1 consists of one value which is used to 
replace all empty slots in the utterances independent of the type. The type 2 
list contains one unique value for each entity type, which is then used to replace 
the empty slots of matching type. The values we used to create our datasets 
are depicted in the last two columns of Table3. In the last step, the previously 
created lists with entity value(s) can now be used to create the datasets for 
training and testing the different NLUs. 
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Table 3. Placeholder values and unique domain values used to replace the empty slots. 


Entity type | Empty slots | Domain values Placeholder values 
Train (80%) | Test (20%) | Combined | Type 1 | Type 2 

lecture 137 69 18 87 x v 

person 46 36 10 46 x p 
semester 21 11 3 14 x s 

subject 41 4 1 5 x f 

module 61 44 11 55 x m 
building 24 19 5 24 x g 

Total 330 183 48 231 1 6 


Experimental Datasets. In order to answer the research questions introduced 
in Sect. 1.1 we conducted a total of five experiments. Thereby we want to deter- 
mine which type of entity values are best suited to create the training data and 
how the trained NLU performs of different test datasets. 

The datasets for training the NLU have been created by filling the designated 
training utterances with some related entity values, as described in Subsect. 2.2. 
Table 4 contains an overview of the experiments and the datasets used to evaluate 
the performance of the NLU. The first two experiments are related to the domain 
concept. In the first experiment (EX 1) the training dataset contains a subset 
of the entity values that have been extracted from the available knowledge base. 
Thereby we want to analyze how well the NLU can perform the two tasks if 
the test set contains unknown utterances and unknown values taken from the 
knowledge base. In addition, we want to determine how well the NLU performs 
if the utterances are filled with entity values taken from another domain, in this 
case, the DBpedia knowledge graph. To determine how well the NLU performs 
if all domain related entity values are used for training, we conducted the second 
experiment (EX 2). 

The third and fourth experiments (EX 3 and 4) have been created to evaluate 
how the performance of the NLU changes if placeholder values are used to train 
the system. In EX 3 the train utterances have been filled with the PH Type 1 
values and in EX 4 they have been filled with PH Type 2 values. 

In the last experiment, we filled the train utterances with the values extracted 
from the DBpedia and merged this one with the EX 1 dataset. Thereby we aim 
to determine if the performance can be approved when the dataset is enriched 
with values taken from another domain. Because we were not able to extract 
entity values form all of the six types, we only used the utterances that contain 
an entity type of at least one of the following types: lecture, building or person. 

The datasets used to test the performance has been created by using either 
the test set of the domain values or the test set of the DBpedia values to fill 
the test utterances. Because the DBpedia set does not contain values of type 
semester, subject and module the domain values of those types have been used 
to create the Test DBpedia dataset. For determining and evaluating the perfor- 
mance of the different conceptual approaches we calculated the precision, recall 
and F1-score of the trained NLUs. No cross-validation has been applied to eval- 
uate the performance. 
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Table 4. Utterances and entity values used to create the experimental datasets. 


Training datasets 


EX 1 Train Utterances + Test Domain Entity Values 
EX 2 Train Utterances + All Domain Entity Values 
EX 3 Train Utterances + PH Type 1 Entity Values 
EX 4 Train Utterances + PH Type 2 Entity Values 
EX 5 EX1 extended by 

Train Utterances + Train DBpedia Entity Values 


Testing datasets 


Domain test | Test Utterances + Test Domain Entity Values 
DBpedia test | Test Utterances + Test DBpedia Entity Values 


4 Evaluation Results 


In this chapter, the results of the different experiments are evaluated. Table5 
provides an overview of the performance values that have been used to measure 
the performance of the NER and the intent classifier of the NLU. In the first 
part of the section, we analyze the results of the NER and intent classifier before 
giving a recommendation about which approach to use for training the two 
components of the NLU. 


4.1 Performance NER 


The first part of Table 5 shows the results when using the Domain Test dataset 
for evaluating the performance of the differently trained NLUs and the second 
part shows the results when using the DBpedia test dataset for testing. The first 
part of the table clearly shows that the datasets related to EX 1, 2 and 5 lead 
to the best NER performances. From those, it can be derived that using more 
unique entity values lead to better results. If all potential entity values that an 
NLU shall be able to extract are known in advance it is best to use them all 
for training. Enlarging the training dataset with utterances that are filled with 
values from another domain does not lead to better results. When using the 
DBpedia test dataset for evaluating the results clearly show that the F1-score of 
EX 5 is highest and therefore most suited for training. The results related to EX 
1 and 2 are in this case far lower. In this case, the discrepancy between EX 1 and 
2 and EX 5 is between 11.7 and 15.6% points. In the previous test, the results 
were much closer with a discrepancy between 3.2 and 6.1% points. In both cases 
training the NER with placeholder values lead to the lowest results. Although 
using PH type 1 values lead so slightly higher results the performance is still 
much lower than that of the other approaches. Due to the low results which are 
more than 50% lower, compared to the other approaches, they are not suited for 
training the NER component of the NLU. 
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Based on these results we recommend to use the approach related to EX 
1 or 2 for training the NER component if the NLU shall be optimized for a 
certain domain. If instead, the NLU shall perform well on several domains we 
recommend to merge the datasets following the approach described in EX 5 to 
maximize the NER’s performance. 


Table 5. Performance results of the conducted experiments. 


NER Embedding classifier 


Precision | Recall | F1-Score | Precision | Recall F1-Score 


Domain test 
EX 1} 0.995 0.895 | 0.940 0.8626 0.8458 | 0.8452 
EX 2 | 1.000 0.967 | 0.979 0.8497 0.8151 | 0.8101 
EX 3 | 0.893 0.286 | 0.428 0.8663 0.8308 | 0.8288 
EX 4 | 0.928 0.220 | 0.351 0.8311 0.8151 | 0.8029 
EX 5| 0.991 0.856 | 0.918 0.8657 0.8308 | 0.8269 
DBpedia test 
EX 1 | 0.878 0.770 | 0.812 0.8448 0.8151 | 0.8103 
EX 2 | 1.000 0.806 | 0.851 0.8522 0.8151 | 0.8170 
EX 3 | 0.835 0.312 | 0.451 0.8907 0.8769 | 0.8722 
EX 4 | 0.848 0.220 | 0.351 0.8317 0.8151 | 0.8024 
EX 5 | 0.992 0.950 | 0.968 0.8479 0.8308 | 0.8185 


4.2 Performance Intent Classifier 


The performance results of the different experiments when using the domain 
test for the evaluation show, that overall all different approaches perform well 
with Fl-scores greater than 80%. By comparing EX 1 and 2 it can be noticed 
that when more unique entity values are used for training the performance of 
the classifier decreases. We assume this to be the case because several entity 
values are used multiple times within different utterances that belong to different 
intents. Because the classifier learns which words relate to which intent, we 
assume that this approach causes a distortion of the vector space which results 
in lower performance results. Therefore EX 1 performs better than EX 2 and is 
overall the best approach for training a domain-specific intent classifier. It has 
to be noticed that when using the other dataset for testing, the results of EX 2 
are slightly higher than that of EX 1, which could indicate that there are other 
factors that have a significant impact of the performance. 

Looking at the performance of the experiments that applied the placeholder 
concept, the results show that this approach is highly applicable for training 
a high-performance intent classifier. Especially when using PH type 1 values 
the discrepancy between EX 1 and EX 3 is only 1.64% points. Furthermore, it 
is possible to train a much more robust classifier using PH type 1 values. As 
can be seen from the results where the DBpedia test dataset has been used for 
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testing the trained classifiers, the one which has been trained using PH type 1 
values performs better than all the other trained classifiers. Therefore it can be 
said that this approach is better suited when we want to train a classifier that 
can perform well in several domains. This approach increases the robustness of 
the NLU which performs best when entity values form the DBpedia domain are 
used in the test utterances. 

Based on the results at hand, we recommend applying the domain approach 
following the EX 1 construction when training an intent classifier that shall only 
perform well in a certain domain. When aiming towards training a more robust 
and open domain intent classifier we recommend to used PH type 1 values to 
construct the training dataset. Although the performance in some domains might 
be lower, compared to using domain-specific values for training, the performance 
overall domains will be higher. 

In order to optimize the performance of the placeholder concept, differently 
designed placeholder values can be tested. We created values of different word 
length and also created values which consisted of two or more random words. 
Although we were not able to increase the performance, it might be possible to 
find values that can be used to increase the performance. 


5 Related Work 


Our contribution in training an NLU targets the research field of chatbots, as 
well as SQA. While most chatbot frameworks (IBM Watson, Microsoft Bot Ser- 
vice) are based on deep learning technologies for Intent and Entity Recognition 
as one NLU component, most SQA systems use static n-gram strategy [22] or 
Entity Linking Tools [5]. The DBpedia Bot [1] is one example for a rule-based, 
static SQA realization. This static approach of Q&A over knowledge graphs 
(KGs) has the disadvantage of only being able to react conditionally to sentence 
conversions. The idea of the Frankenstein Framework [22] is to link these static 
approaches by generalizing SQA into 3 steps (Named Entity Recognition and 
Disambiguation, Relation Linking and Query Building). Considering the SQA 
task our work addresses the NER and NED component, whereby an intent clas- 
sification task is also taken into account and could improve the query building 
component. In general, it is possible to train multiple closed domain systems, 
which would make the NLU applicable in multiple domains [17]. For the present 
study, the closed domain knowledge is stored in a database and used to create 
the training data for the NLU. The database contains all entity values that users 
might use in their utterances. 

Bapat et al. [2] already presented an end-to-end pipeline for simplifying the 
NLU training process, where the first sentences are defined and extended for 
the following training. While the extension of the training dataset is skipped 
and only classified into 5 categories of possible extension methods, our app- 
roach mainly targets the class of generating big pools of parameter values. The 
following NLU training was conducted by using the state-of-the-art and open- 
source software of the Berlin-based company Rasa [16]. The extraction of entities 
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and the classification of intents can be regarded as two separate tasks that can 
be achieved by two different pipelines that are merged into one coherent NLU 
pipeline. The intent classification pipeline uses the tokenized utterances created 
by the spaCy model [9]. During training, each token and intent label is repre- 
sented as a feature vector, except for digits, all of which are assigned to the 
same feature vector. The embeddings model is based on the StarSpace model 
developed by Facebook [24]. During training, the embeddings classifier learns its 
own embeddings for each of the words in the training dataset, thereby taking 
into account domain-specific uses of words [15]. The created feature vectors are 
enriched by an additional three dimensions using the intent_featurizer_ngrams. 
Again, the three most common n-grams in the training data are determined and 
the three added dimensions are used to indicate whether a given token includes 
one of these n-grams. The NER pipeline tokenizes the incoming utterance into 
its elements by also using the spaCy model and automatically assigns POS tags 
to each word in the utterance. Since only CRF [12] is supported as a NER algo- 
rithm in Rasa, it was applied for the experiments. Placeholder concepts could 
be considered as a way to increase the number of training examples and thus 
improve the NLU performance. 


6 Conclusion and Outlook 


Three different design approaches for creating labeled training datasets were 
developed and integrated into a holistic development process to design the NLU 
of a task-oriented DS and to create a corresponding dataset for training the 
component. While the experiments for RQ 1 clearly show that using more unique 
Entities improves the performance of the NER component, a placeholder concept 
only affects the intent classifier (RQ 2) slightly. In terms of robustness, the 
evaluation of EX 5 on different test datasets shows, that the performance of 
the NER component can be increased by including train datasets from different 
domains. With placeholder values from different domains, we show how Linked 
Data can help to increase (RQ 3) not only NLU robustness but also overall 
performance in open domains. 

A challenge that appears with RQ 1-2 is the generalizability of the proposed 
concepts. We mainly address small, domain-specific databases, whereby an eval- 
uation on larger datasets with multiples domains could lead to synergy effects 
within the creation of the NLU training dataset. For further research, the NLU 
component could be integrated into the Frankenstein framework and evaluated 
on the SQA challenge dataset [14]. 
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Abstract. Data from relational web tables can be used to augment 
cross-domain knowledge bases like DBpedia, Wikidata, or the Google 
Knowledge Graph with descriptions of entities that are not yet part 
of the knowledge base. Such long-tail entities can include for instance 
small villages, niche songs, or athletes that play in lower-level leagues. 
In previous work, we have presented an approach to successfully assem- 
ble descriptions of long-tail entities from relational HTML tables using 
supervised matching methods and manually labeled training data in the 
form of positive and negative entity matches. Manually labeling training 
data is a laborious task given knowledge bases covering many different 
classes. In this work, we investigate reducing the labeling effort for the 
task of long-tail entity extraction by using weak supervision. We present a 
bootstrapping approach that requires domain experts to provide a small 
set of simple, class-specific matching rules, instead of requiring them to 
label a large set of entity matches, thereby reducing the human super- 
vision effort considerably. We evaluate this weak supervision approach 
and find that it performs only slightly worse compared to methods that 
rely on large sets of manually labeled entity matches. 


1 Introduction 


Cross-domain knowledge bases like YAGO [8], DBpedia [9], Wikidata [20], or 
the Google Knowledge Graph are being employed for an increasing range of 
applications, including natural language processing, web search, and question 
answering. The entity coverage of knowledge bases is far from complete [4, 16]. 
YAGO and DBpedia e.g. rely on data extracted from Wikipedia and as a result 
cover mostly head instances that fulfill the Wikipedia notability criteria [12]. As 
the utility of a knowledge base increases for many tasks with its completeness, 
adding long-tail entities to a knowledge base is an important task. 

Web tables [3], which are relational HTML tables extracted from the Web, 
contain large amounts of structured information, covering a wide range of topics. 
In previous work [12], we proposed a method for extracting long-tail entities and 
showed that web tables are a promising source for augmenting knowledge bases 
© The Author(s) 2019 
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with new and formerly unknown entities. For this, we trained models using large 
sets of manually labeled class-specific entity matches. Given that knowledge 
bases can have many classes, manual labeling limits the usefulness of automatic 
knowledge base augmentation from web tables. 

Weak supervision approaches aim at reducing labeling effort by using super- 
vision that is more abstract or noisier compared to traditional manually labeled 
high-quality training examples (strong supervision) [14]. Data programming [15] 
is a paradigm, where experts are tasked with codifying any form of weak supervi- 
sion into labeling functions. These functions are then employed within a broader 
system to generate training data by assigning labels and confidence scores to 
unlabeled data. Recently, various different systems based on the data program- 
ming paradigm have been suggested [1, 14,19]. 

For many types of entities, humans generally possess knowledge about when 
entities definitely match, and what are strong signals that entities do not match. 
Writing down this general knowledge in the form of simple bold matching rules 
requires far less effort than labeling many individual positive and negative entity 
matches. Building on this observation and the data programming paradigm, this 
paper investigates for the task of long-tail entity extraction whether strong super- 
vision in the form of positive and negative entity matches can be replaced by a 
set of simple bold matching rules. In order to make it easy to write down such 
rules, we restrict the rule format to conjuncts of equality tests. These tests are 
expressed using the schema of the knowledge base without requiring experts to 
assign weights or specify similarity metrics. Additionally, we introduce a boot- 
strapping method that exploits the matching rule sets to generate training data 
and train a supervised machine learning algorithm. Using these approaches, we 
are able to significantly reduce supervision effort compared to manually labeling 
positive and negative entity matches, while achieving a comparable performance. 

Our contributions are (1) a weak supervision approach that substitutes man- 
ually labeled training pairs by a set of bold matching rules, (2) a bootstrapping 
approach which uses weak supervision to generate training data for a super- 
vised matching method, and (3) an evaluation that compares strong and weak 
supervision for the task of long-tail entity extraction. 

The remainder of this paper is structured as follows. First, we describe our 
long-tail entity extraction method, including the experimental setup and a sum- 
mary of results when using strong supervision. Section 3 describes our weak super- 
vision methodology, while Sects.4 and 5 present and discuss our experiments. 
Section 6 compares our approach to the related work. The results presented in 
this paper are fully reproducible, as we publicly provide all code and datasets.! 


2 Long-Tail Entity Extraction 


In previous work, we proposed and evaluated a method for long-tail entity 
extraction from web tables [12]. This section summarizes the proposed approach, 
describes our experimental setup, and presents results achieved using manually 
labeled training data. 


1 http: //data.dws.informatik.uni-mannheim.de/expansion/LTEE/. 
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2.1 Methodology 


Extracting long-tail entities from web tables for knowledge base augmentation 
is a non-trivial task. It consists of two subtasks: (1) identifying entities that are 
not yet part of the knowledge base and (2) compiling descriptions for those new 
entities from web table data according to the schema of the knowledge base. 


New entities added to 
knowledge base 


Schema Row Entity New 
Matching Clustering Creation Detection 


Knowledge 
base 


Web tables 


Fig. 1. Pipeline for extending a knowledge base with long-tail entities from web tables. 


Long-Tail Entity Extraction Pipeline. Figurel gives an overview of our 
suggested approach. It is a pipeline that starts with web tables and ends by 
adding new entities to a cross-domain knowledge base. We first cluster all rows 
that describe the same real-world instance together. From these clusters we then 
create entities by compiling descriptions from web table data. Finally, the new 
detection component determines which entities are new, given a specific target 
knowledge base. As a result, we are able to perform the two subtasks of identi- 
fying new entities and compiling their descriptions. 


Schema Matching. The first component of the pipeline is schema matching. 
It creates a mapping between web tables and the knowledge base schema. This 
includes matching web tables to classes and web table columns to properties. The 
latter, termed attribute-to-property correspondences [17], allow us to semanti- 
cally understand cell values. They are exploited by the entity creation component 
to compile description according to the schema of the knowledge base and by 
both, the row clustering and new detection components, as similarity features. 


Performing Row Clustering and New Detection. For both, row cluster- 
ing and new detection, we train random forest classifiers that perform entity 
matching. For row clustering, the classifier compares a row pair to determine if 
the two rows describe the same entity, while for new detection this is done for a 
pair of a created entity and a candidate instance from the knowledge base. 

Comparing all possible row pairs or entity-instance-pairs would not scale. We 
therefore utilize a label-based blocking approach using a Lucene index to find 
candidates to be compared. 

Each matching decision is also given a confidence score. For row clustering, 
we use the confidence scores to perform correlation clustering and generate the 
row clusters. For new detection, we return an entity as new, only if all candidate 
instances from the knowledge base were classified as clear non-matches. 
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Table 1. Overview of the number of labels in the T4LTE gold standard. 


Label type GF-Player | Song | Settlement | Sum 
Row pair 1,298 231 | 2,768 4,297 
Entity-instance-pair 80 34 |51 165 
New entity classification | 17 63 | 23 103 
Sum 1,395 328 | 2,842 4,565 


Similarity Features. To train a classifier, we exploit various features, which 
are described in more details in our previous work [12]. Among the features are 
first the similarities of labels (LABEL) and bag-of-words vectors (BOW). Secondly, 
using the attribute-to-property correspondences we derive values according to 
the knowledge base schema, which we compare using data-type-specific similarity 
functions (ATTRIBUTE). Using the knowledge base we also derive for each table 
implicit attributes about the entities described in the table, giving us another set 
of values by knowledge base property that we compare using data-type-specific 
similarity functions (IMPLICIT_ATT). For row clustering, we additionally exploit 
the PHI correlation of row labels (PHI) and penalize rows which occur in the 
same table (SAME_TABLE). For new detection, we additionally exploit type overlap 
between a created entity and a candidate knowledge base instance (TYPE), and 
the popularity of a candidate knowledge base instance (POPULARITY). 

For each row pair or entity-instance-pair most features return a single nor- 
malized similarity score. For ATTRIBUTE and IMPLICIT_ATT, we return for a pair 
two scores for each property from the knowledge base schema. One score mea- 
sures the confidence of the pair having equal values given that property, the 
other of the pair having unequal values. 


2.2 Experimental Setup and Results 


We employ the 2014 release of DBpedia [9] as the target knowledge base and 
evaluate our methods on the task of extending the DBpedia classes Gridiron- 
FootballPlayer (GF-Player), Song”, and Settlement with additional entities. To 
ensure diversity among the classes, we selected each from a different first-level 
class, i.e. Agent, Work, and Place. 

We utilize the English-language relational tables set of the Web Data Com- 
mons 2012 Web Table Corpus.’ The set consists of 91.8 million tables. For every 
table we assume that there is one column that contains the labels of the instances 
described by the rows. The remaining columns contain values, which potentially 
can be matched to properties in the knowledge base schema. 

For training and evaluation we built the Web Tables for Long-Tail Entity 
Extraction* (T4LTE) gold standard. Table 1 provides an overview of the num- 
ber of labels in T4LTE. Creating this dataset was rather laborious, as we 


? The class Song also includes all instances of the class Single. 
3 http: //webdatacommons.org/webtables/#toc3. 
* http://webdatacommons.org/T4LTE/. 
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labeled 4,297 matching row pairs, 165 entity-instance-pairs and 103 new entity 
classifications. 


1000 random 
unlabeled 


Output: generated 
training data 


Unsupervised tables 
class-agnostic 
matching rule 


Labeling Function 
Set of class- 
specific rules 


(weak supervision) Ensembled weakly Random forest 


x pa learning algorithm 
supervised classifier gag 


Fig. 2. Our overall methodology of introducing weak supervision using class-specific 
rule sets and bootstrapping a supervised learning algorithm using a labeling function. 


When evaluating the pipeline using the T4LTE gold standard using cross- 
validation, we were able to achieve an F1 score in the task of finding new entities 
of 0.80. When running the pipeline on the whole web table corpus, we were able 
to add 14 thousand new gridiron football players and 187 thousand new songs 
to DBpedia, an increase of 67% and 356% respectively [12]. 


3 Methodology 


This section describes our approaches for the task of reducing labeling effort 
using weak supervision. The overall methodology is illustrated in Fig. 2. 

We first introduce as a baseline two unsupervised class-agnostic matching 
rules for row clustering and new detection. These rules exploit the similarity 
features described above and aggregate them using a weighted average. 

We then introduce an approach that exploits user-provided class-specific rule 
sets as weak supervision. These rules have a high accuracy, but low coverage, 
which is why we ensemble them with the unsupervised matching rule to derive 
weakly supervised classifiers for both row clustering and new detection. 

Both, the unsupervised matching rules and the weakly supervised classifiers 
can be used in our pipeline directly. We additionally introduce an approach that 
exploits these methods as labeling functions to bootstrap a supervised learning 
algorithm. This is done by using a set of unlabeled web tables to label training 
pairs for both row clustering and new detection. The labeled data is then used 
to train random forest classifiers to be used in our pipeline. 


3.1 Unsupervised Class-Agnostic Matching Rule 


We suggest two unsupervised matching rules that aggregate using a weighted 
average the individual scores generated by the features described in Sect. 2.1. 
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To be used in a rule, all features must produce scores that are normalized 
and class-agnostic. This already applies to all features except ATTRIBUTE and 
IMPLICIT_ATT, where, given a pair, we normalize by averaging the individual 
property scores, giving us one normalized class-agnostic score per feature. 

We determine the weights of the rules by assigning, based on our own experi- 
ence with the metrics, importance factors from 4 to 1 to the individual features. 
The weight of a feature is equal to it assigned factor normalized by the sum 
of all factors. For the row clustering rule we assign a factor of 4 to LABEL, 2 
to BOW and ATTRIBUTE, and 1 to PHI, IMPLICIT_ATT and SAME_TABLE. For new 
detection we assign a factor of 4 to LABEL, 3 for BOW and ATTRIBUTE, 2 for TYPE 
and IMPLICIT_ATT, and 1 for POPULARITY. 

The rules determine whether a pair matches or not using a fixed threshold, 
simply set at 0.5 for both rules. The absolute distance of a computed average 
from the threshold determines the confidence of a matching decision. 


3.2 Class-Specific User-Provided Matching Rules 


Humans often possess general knowledge about which conditions need to be 
fulfilled for entities of a certain domain to clearly match or clearly not match. 
Based on this observation, we suggest as weak supervision a set of user-provided 
bold class-specific rules that classify a given candidate pair as a match or non- 
match. They can codify obvious knowledge, e.g. that a settlement can not be 
in two different countries, or non-obvious knowledge, e.g. that only one unique 
football athlete can be drafted in the same year with the same pick number. 

The rules consists of conjunctions of attribute tests, expressed using the 
schema of the knowledge base. It is only required that the provided rules be 
accurate, regardless of their coverage. This makes it a simple task to identify 
suitable rules and is the reason why we term these rules as bold. For our exper- 
iments, we created per class four rules. For GF-Player we came up with two 
matching and two non-matching rules: 


(draft Year = Equal) A (draftPick = Equal) — Match (1) 
(LABEL = Equal) A (birthDate = Equal) — Match (2) 
(draft Year = Unequal) — Non-Match (3) 
(draftPick = Unequal) — Non-Match (4) 


For Song we also came up with two matching and two non-matching rules: 


(LABEL = Equal) A (artist = Equal) A (releaseDate = Equal) — Match 


( 

(LABEL = Equal) A (artist = Equal) A (album = Equal) — Match (6) 
(artist = Unequal) — Non-Match ( 
( 


(release Year = Unequal) — Non-Match 
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Finally, for Settlement we have three matching and one non-matching rule: 


(country = Equal) A (postalCode = Equal) — Match (9) 
(LABEL = Equal) A (isPartOf = Equal) — Match (10) 
(LABEL = Equal) A (postalCode = Equal) — Match (11) 
(country = Unequal) — Non-Match (12) 


The effort spent creating these rules is minuscule compared to manually 
labeling the correspondences in the gold standard. While for each class we created 
only 4 rules, they are tested to substitute 1,395, 328, and 2,842 labels for the 
classes GF-Player, Song, and Settlement respectively. 

To apply a rule we exploit the equal and unequal scores generated by the 
ATTRIBUTE and IMPLICIT_ATT features, as described in Sect. 2.1, and the LABEL 
feature using a data-type specific equivalence threshold [12]. A rule fires, when 
all tests within the rule have scores higher than zero. From these scores we also 
derive for each rule firing a confidence score, which equals the product of all 
scores used within the rule. 

As the rules fire only when certain conditions are met, the set of rules is not 
exhaustive and only covers a subset of compared pairs. We therefore ensemble the 
rules with the unsupervised matching rule through averaging. Given a compared 
pair, we first check how many rules fire. If no rule fires, we simply return the 
output of the unsupervised matching rule. If multiple rules fire, which is possible 
as the rules are not mutually exclusive, we consider only the rule with the highest 
confidence, preferring negative rules in case of a tie. If the confidence of this rule 
is higher than the confidence of the output of the unsupervised matching rule, 
the outputs of both are averaged and returned. Otherwise, we simply return the 
output of the unsupervised matching rule. 


3.3 Bootstrapping Approach 


In our experiments, we, on the one hand, directly apply the unsupervised rule 
and the weakly supervised ensembled classifier to our test data. On the other 
hand, following the data-programming paradigm, we employ both methods as 
labeling functions to label row pairs and entity-instance-pairs derived from 1000 
randomly selected web tables as matches or non-matches. Additionally, the label- 
ing functions assign weights to the training examples using the confidence scores 
returned by the underlying method. Using these labels we train a random forest 
classifier, which is then applied to our test data. 

To derive pairs to be labeled, we employ label-based blocking using Lucene 
for both row clustering and new detection. We additionally include random pairs 
to be labeled, for row clustering as many as there are positive pairs, and for new 
detection 8 random instances selected from the knowledge base from within the 
same class of an entity or its parent classes. Overall, this leads to 2.8m row pairs 
and 1.27 m entity-instance-pairs selected to be labeled. 
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For row clustering, we use the confidence scores to additionally perform corre- 
lation clustering. A row pair labeled as a match but not part of the same cluster, 
is not included as a positive training example. Similarly, a row pair labeled as 
a non-match, but placed in the same cluster, is not considered as a negative 
training example. 

For new detection, when multiple entity-instance-pairs of the same entity 
are labeled as matching, which can not be correct, we only include the entity- 
instance-pair with the highest score as a positive training example. 


Table 2. Row clustering performance for runs with various types of supervision. 


Method Average GF-Player | Song | Settlement 
PCP|AR F1 Fi Fl |F1 

Unsupervised 0.76 | 0.86 | 0.80 | 0.90 0.65 | 0.86 

+ Bootstrapping | 0.78 | 0.88 0.83 | 0.89 0.73 | 0.86 

Weak supervision | 0.83 | 0.89 0.86 | 0.93 0.81 | 0.84 

+ Bootstrapping | 0.83 | 0.90 0.86 | 0.89 0.83 | 0.86 

Strong supervision | 0.86 | 0.90 | 0.88 0.91 0.84 | 0.90 

+ Bootstrapping | 0.85 | 0.90 0.87 | 0.92 0.79 | 0.91 


When bootstrapping for new detection, we also need a set of row clusters 
from which we create entities. Using these entities we can then generate training 
examples using entity-instance-pairs and our labeling function. To create these 
clusters, we use the supervised model trained by bootstrapping from a label- 
ing function of equal supervision, i.e. when we are bootstrapping a supervised 
learning algorithm for new detection using the unsupervised rule, we use the 
clustering method also trained using bootstrapping and the unsupervised rule. 

Given the labeled pairs, we train a random forest classifier. Per forest, we 
train 2000 trees. To reduce correlation between trees, we set the features available 
at each split to 2, and reduce the sample size used to train each tree to 66% of 
the total number of pairs. We sample with replacement and using weights, so 
that higher weighted examples are considered more often during training. 


4 Evaluation and Results 


In this section, we evaluate, using the T4LTE gold standard, the approaches 
described above and compare them to a model trained with manually labeled 
data. As for the latter, the gold standard is also used for training, we apply three- 
fold cross-validation throughout all experiments. Additionally, we will evaluate 
the effectiveness of the user-provided rule sets and our bootstrapping approach. 
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4.1 Row Clustering Evaluation 


To evaluate row clustering, we employ the evaluation metric proposed by Has- 
sanzadeh et al. [7,12]. It emphasizes replicating the exact number of clusters in 
the evaluation set by first computing a one-to-one mapping between returned 
clusters and clusters in the evaluation set. Only rows of clusters with a mapping 
contribute towards recall, while the pairwise clustering precision is penalized 
by the difference between the number of clusters in the evaluation set and the 
number of returned clusters, or the clusters with a mapping, whichever is higher. 

Table 2 shows row clustering performance for different types of supervision. 
The first two rows show performances when using the unsupervised matching 
rule alone, while the following two rows show the performances when using the 
weakly supervised ensembled classifier. The final two rows show the performances 
when using strong supervision. For each supervision type we apply and evaluate 
the underlying method directly on the test set, and then use it as a labeling 
function to bootstrap a random forest, which we then also apply and evaluate 
on the test set. For strong supervision, the bootstrapped method resembles a 
semi-supervised learning approach. 

From the table, we can see that the difference in average F1 between a model 
trained using strong supervision, which has an F1 of 0.88, and the unsupervised 
rule without bootstrapping is 8 pp. We find that using bootstrapping with the 
unsupervised matching rule allows us to increase F1 by 3 pp on average, with an 
increase of 8 pp for the class Song. Using user-provided class-specific rule sets, 
we achieve an average F1 score of 0.86, which is a large increase of overall 6 pp 
from the unsupervised rule and very close to the performance when using strong 
supervision. Applying bootstrapping on the weakly supervised method does not 
increase average F1 further, mainly because we lose performance for the class 
GF-Player, while gaining performance in the other two classes. This is similarly 
the case when bootstrapping from a model trained using strong supervision, 
except we also lose one percentage point in average F 1. 

When bootstrapping, the labeling functions were given overall 2.8m row 
pairs to label, which were selected either by the label-based blocker or chosen 
randomly. Given as labeling function the weakly supervised ensembled classifier, 
275 thousand pairs were labeled as matches, while 2.54 m pairs were labeled as 
non-matches. For this output, the user-provided matching rules fire in total 37 
thousand times, whereas the non-matching rules fire 500 thousand times. 


4.2 New Detection Evaluation 


We evaluate a new detection method using both, the existing and the new entities 
labeled in the gold standard. Precision equals the proportion of entities returned 
as new by the method, that are actually new, while recall equals the proportion 
of new entities in the testing set, that were returned as new by the method. 
Table 3 shows new detection performance for runs with various types of super- 
vision, similar to Table 2. We first find that a model trained using the provided 
strong supervision outperforms the unsupervised matching rule in F1 by 7 pp on 
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Table 3. New detection performance for runs with various types of supervision. 


Method Average GF-Player | Song | Settlement 
P R |Fl FI Fl |F1 

Unsupervised 0.87 | 0.76 | 0.80 0.82 0.68 | 0.89 

+ Bootstrapping | 0.86 | 0.86 | 0.85 | 0.86 0.78 | 0.90 

Weak supervision | 0.87 | 0.81 | 0.83 | 0.82 0.78 | 0.89 

+ Bootstrapping | 0.87 | 0.90 | 0.87 0.87 0.85 | 0.90 

Strong supervision | 0.82 | 0.94 | 0.87 0.88 0.92 | 0.81 

+ Bootstrapping | 0.81 | 0.97 | 0.88 | 0.88 0.92 | 0.83 
average, and by 24 points for the class Song. On the other hand, the unsuper- 
vised matching rule outperforms the model trained using strong supervision by 


8 pp for the class Settlement, indicating that the trained model highly overfits. 
By employing the user-provided rule sets as weak supervision, we are able to 
increase average F1 by 3 pp. 

Unlike for row clustering, bootstrapping is consistently effective for new 
detection. It increases average F1 in the unsupervised case by 5, and in the 
weakly supervised case by 3 pp. The latter allows us to achieve an equal average 
F1 to that of strong supervision, albeit a large part is due to the Settlement 
class, while for Song we are still lacking 7 points in F1. Bootstrapping is also 
effective when used with a model trained using strong supervision. 

When bootstrapping, a sum of 1.27m entity-instance-pairs are given to the 
labeling functions to be labeled. When using the ensembled classifier, we find 
that 26 thousand pairs were labeled as matches, and the remainder as non- 
matches. Within the ensembled classifier, the user-provided matching rules fire 
13 thousand times, whereas the non-matching rules fire 150 thousand times. 


4.3 End-To-End Evaluation 


We will now evaluate a full run of the pipeline using weak supervision. As this 
runs row clustering and new detection sequentially, the errors of the methods 
tend to accumulate and reduce overall end-to-end performance [12]. 

To evaluate how well new entities were found, we utilize precision and recall. 
To compute precision, we determine the proportion of entities returned as new 
that are correct. An entity is only correctly new, if its cluster includes the major- 
ity of the rows of a new cluster in the gold standard, and these rows at the same 
time form the majority within the entity’s cluster. Recall is the fraction of new 
entities in the gold standard for which a correct new entity was returned. 

Table 4 shows end-to-end performance for different types of supervision simi- 
lar to Table 2. From the table we can see that the highest performance is achieved 
by the model trained using strong supervision. It achieves an average F1 of 0.81. 
The highest performance achieved by the methods without strong supervision 
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Table 4. End-to-end evaluation for various types of supervision. 


Method Average GF-Player | Song | Settlement 
P R |F1 /F1 Fl |F1 

Unsupervised 0.71 | 0.71 | 0.69 0.76 0.50 | 0.82 

+ Bootstrapping | 0.71 | 0.81 | 0.74 | 0.79 0.60 | 0.82 

Weak supervision | 0.72 | 0.77 | 0.74 | 0.76 0.63 | 0.82 

+ Bootstrapping | 0.72 | 0.86 | 0.78 | 0.81 0.72 | 0.80 

Strong supervision | 0.73 | 0.93 | 0.81 | 0.84 0.78 | 0.81 

+ Bootstrapping | 0.68 | 0.93 | 0.78 | 0.84 0.69 | 0.80 


is 0.78 for the weak supervision method with bootstrapping. The lowest perfor- 
mance of 0.69 is achieved by the unsupervised method without bootstrapping. 
Overall, we find that we are able to achieve a performance quite close to that 
when using strong supervision, and much better than a simple unsupervised 
matching rule. As a result, we can successfully perform long-tail entity extrac- 
tion with significantly reduced labeling effort. While on average, we lose recall 
with almost no loss in precision, the actual effect differs per individual class. 

The user-provided rule sets have a strong positive impact on performance, 
increasing F1 by 5 pp. Bootstrapping also increases average F1 by 5 and 4 pp for 
the unsupervised and weakly supervised runs respectively. Overall, we achieve an 
increase of 9 points when comparing a weakly supervised bootstrapped method 
with an unsupervised non-bootstrapped method. The effect is especially large 
for Song, where we gain 22 pp in F1. 

Bootstrapping from a strongly supervised method is not effective and reduces 
overall performance. This is because, bootstrapping had mixed results when it 
comes to row clustering for both, weak and strong supervision. This is especially 
the case for the class Song, where a method bootstrapped from strong supervision 
produces 29 bad clusters, leading to a significant drop in end-to-end performance. 

Finally, we notice that precision is continuously lower than recall. For GF- 
Player and Settlement we have e.g. precisions of 0.68 and 0.70, with recalls of 
1.00 and 0.92 respectively. This problem is caused by bad clustering, primarily 
for existing entities, which are then classified as new by the new detection compo- 
nent, thereby reducing precision, without affecting recall. When summing num- 
bers for all testing folds, we are missing for football players 8 existing clusters, 
meaning the rows were incorrectly included in other existing clusters, causing 
them to be impure. In the case of settlements we have overall generated 16 extra 
existing clusters. This leads for GF-Player and Settlements to 8 and 9 clusters 
respectively, being incorrectly determined to be new. This shows, that errors in 
the pipeline accumulate and that there is a need for an additional component 
in the pipeline that detects and filters out bad clusters. While this pattern does 
not exist for class Song, it is because it suffers from bad clustering for new and 
existing clusters, leading to lower recall and precision. As a result, even the class 
Song would benefit from a bad cluster filtering component. 
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5 Discussion 


Ensembling the user-provided rule sets with an unsupervised matching rule, 
yields a quite effective method that requires minimal supervision. The unsuper- 
vised rule, while class-agnostic and simple, still provides an acceptable baseline 
performance, and more importantly, full coverage to our method. This allows us 
to require that the rules only be accurate, but not exhaustive, even when the 
number of provided rules is small. Additionally, these rules are not only easily 
created by an expert, but could also be mined from or tested on the knowledge 
base, further reducing supervision effort. A big limitation of our approach is that 
the rule sets require web tables to describe entities using useful knowledge base 
properties. This is not the case for settlements, where we find that the number 
and density of attributes in the web tables are limited [12]. 

While bootstrapping produces mixed results for row clustering, its impact on 
new detection and end-to-end performance is positive. There are several factors 
that possibly contribute to this positive effect. First, a random forest is more 
expressive than either, the unsupervised matching rule or the user-provided rule 
sets. It also exploits a larger feature set than both, especially making use of 
the class-specific scores returned by the ATTRIBUTE and IMPLICIT_ATT features. 
By weighting training pairs, we ensure that pairs with a higher confidence are 
given a higher importance, while less certain pairs are still considered. As boot- 
strapping works within the context of a component, i.e. row clustering or new 
detection, it can make use of component-specific characteristics. For example, 
given one created entity, only one knowledge base instance can possibly be a cor- 
rect match. This allows us to eliminate likely incorrect training examples during 
bootstrapping for new detection by keeping for one entity only the matching 
entity-instance-pair with the highest confidence. 


6 Related Work 


Various methods exist to reduce effort spent on manual labeling. Semi-supervised 
methods use a small set of labeled and a larger set of unlabeled examples to train 
a model. This includes for example co-training and self-training, which train 
models on data that they labeled themselves, using initially a small number of 
seed examples. Another approach to reducing labeling effort is active learning, 
where a user is queried to label examples that are chosen to provide the most 
information when labeled [6]. 

Weak supervision approaches exploit supervision at a higher abstraction 
or that is noisier in nature to efficiently generate a large number of training 
examples, even if those are of a lower quality [14,15]. This includes letting non- 
experts generate labels through crowdsourcing or employing rules and heuristics 
for labeling data. Multiple weak supervision approaches can be combined to 
overcome the possibly lower accuracy and coverage of weak supervision [14]. 

One method of weak supervision is distant supervision [11], where a knowl- 
edge base or any other external resource is used to train a supervised algorithm. 
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While originally applied in the context of relation extraction from text, it has 
been used for the task of augmenting a knowledge base from semi-structured 
web data, including web tables [4,10]. Bizer et al. [2] make use of schema.org 
annotations extracted from 43 thousand e-shops to distantly supervise a deep 
neural network for product matching. To generate training pairs, they make use 
of generic product identifies that are often provided along the annotations. 

Ratner et al. [15] introduce the data programming paradigm, where any weak 
supervision strategy, including domain heuristics and distant supervision, can 
be codified into individual low-coverage labeling functions. The authors focus 
on denoising noisy and conflicting labels, by assigning accuracies to labeling 
functions using a generative algorithm. In contrast, we do not label using the 
individual rules, but first ensemble a set of rules and an unsupervised weighted 
average rule to create one labeling function per class. While we attempt to 
overcome the low coverage of our rules using ensembling, the authors do not 
suggest an approach to overcome the possible low coverage of their labeling 
functions. Snorkel is a system that enables the use of weak supervision based on 
the data programming paradigm [14]. Snorkel Drybell adapts Snorkel to exploits 
diverse organizational knowledge resources. Its effectiveness is evaluated in a 
large-scale case-study at Google [1]. 

Snuba [19] is a weak supervision system that uses a small set of labeled 
data to derive heuristics to generate training data and train a machine learning 
model. The heuristics are similar in purpose to our rule sets, and the authors 
also limit themselves to what they term primitive features, which in their case 
are bag-of-words representation for text or bounding box attributes for images. 
In our case, we limit our self to attribute tests using the schema of a knowledge 
base. As in our case, training a machine learning model yield an increase in 
performance, which the authors similarly contribute to the fact that learned 
models are more expressive and can exploit more features. Snuba still requires 
hundreds of manually labeled training examples to derive heuristics, whereas in 
our case experts only need to provide a small number of bold matching rules. 

Shen et al. [18] introduce constraint-based entity matching, where they sug- 
gest a probabilistic framework within which domain-specific constraints can be 
exploited to perform entity matching without the need for manual labeling. The 
introduced constraint are of a broad-variety, and not limited to a specific format. 
Their work differs from ours, as, first of all, their constraints are generally more 
complex and not based on simple attribute tests using a predefined schema. This 
makes providing supervision less straight-forward and possibly more laborious 
for experts. Additionally, they only provide a matching method that uses the 
constraints directly, and do not consider using them to bootstrap a supervised 
machine learning algorithm. 

To bootstrap supervised learning, a small number of labeled seed examples 
are often used [11,13], but there have also been approaches that use alternatives 
to seeds, e.g. domain-independent patterns [5]. We bootstrap by using a classifier 
that ensembles a heuristic domain-agnostic matching rule and a limited set of 
user-provided class-specific matching rule sets. 
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7 Conclusion 


This work investigates the possibility of reducing the effort spent on manually 
labeling training data for the task of augmenting knowledge bases with long-tail 
entities from web tables. For this, we introduce and evaluate a weak supervision 
approach that exploits more efficient supervision at a higher level of abstraction. 

Specifically, we suggest, as an alternative to manually labeling thousands of 
entity matching pairs, the use of a small set of bold user-provided class-specific 
matching rules. These rules are built upon properties from the schema of a 
knowledge base class, making them universal and semantically easy to under- 
stand. More importantly, these rules require considerably less effort to create. 
To overcome the possibly limited coverage of these rules, we suggest a method to 
ensemble these class-specific matching rules with a class-agnostic unsupervised 
matching model. This yields an effective weakly supervised method for long-tail 
entity extraction. 

We then introduce an approach to bootstrap a supervised learning algorithm 
by using the weakly supervised method as a labeling function and a set of unla- 
beled web tables. We find that with bootstrapping, we are able to achieve a per- 
formance close to that of supervision with manually labeled data. As a result, we 
are able to perform long-tail entity extraction with considerably reduced effort 
spent on supervision. 

Our weak supervision approach can be highly useful for a variety of tasks. In 
case where recall is a secondary objective, our approach can be tuned towards 
precision and used to add highly accurate, albeit fewer, long-tail entities to a 
knowledge base. The approach can also be used to facilitate generating train- 
ing data for manual labeling, where experts must only correct generated labels 
instead of creating them. This would considerably reduce the effort required for 
manually labeling training data. 

We believe that an interesting direction for future work would be combining 
weakly supervised labeling functions and active learning. The labeling functions 
could be used to reduce the effort spent of learning initial models. These models 
can afterwards be refined by labeling individual examples chosen by the active 
learning method. 
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Abstract. Path querying on Semantic Networks is gaining increased 
focus because of its broad applicability. Some graph databases offer sup- 
port for variants of path queries e.g. shortest path. However, many appli- 
cations have the need for the set version of various path problem i.e. 
finding paths between multiple source and multiple destination nodes 
(subject to different kinds of constraints). Further, the sets of source 
and destination nodes may be described declaratively as patterns, rather 
than given explicitly. Such queries lead to the requirement of integrating 
graph pattern matching with path problem solving. There are currently 
existing limitations in support of such queries (either inability to express 
some classes, incomplete results, inability to complete query evaluation 
unless graph patterns are extremely selective, etc). 

In this paper, we propose a framework for evaluating generalized path 
queries - gpqs that integrate an algebraic technique for solving path 
problems with SPARQL graph pattern matching. The integrated alge- 
braic querying technique enables more scalable and efficient processing 
of gpqs, including the possibility of support for a broader range of path 
constraints. We present the approach and implementation strategy and 
compare performance and query expressiveness with a popular graph 
engine. 


Keywords: Algebraic interpretation - Path query - Graph pattern 
matching 


1 Introduction 


Many applications have to find connections between entities in datasets. In graph 
theoretic terms, this amounts to querying for paths in graphs, between multiple 
sources and destinations. Often the sets of sources and destinations cannot be 
easily given explicitly but rather in terms of patterns to be matched in graphs. 
For example, to assess security risks for flights, security officials may want to 
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know about relationships between pl = passengers on any flights to a particular 
destination within a particular time window who purchased one-way tickets by 
cash, and p2 = countries on the CIA watchlist. Here p1 and p2 are patterns 
describing the set of sources and destinations of interest. Such inquiries also 
commonly occur when dealing with biological networks as well as in several non- 
traditional emerging applications e.g. networking. For the latter example, sup- 
pose there is a network composed of SDN ASs (Autonomous Systems), where 
an AS controller may want to compute a domain-level path from one node to 
another for an application where the query includes constraints related to busi- 
ness relationships with potential transit domains. Another feature of path queries 
as demonstrated by the networking example is that, there can be constraints on 
paths e.g. avoid domains of type T or constraints on path disjointedness (link- 
or node-disjoint for specific resilience level) or other structural constraints. Such 
constraints are more expressive than the property path queries which require a 
regular expression of the properties in path being searched for. In a sense, these 
queries are traditional path queries generalized to include graph patterns and 
path constraints. We refer to such queries as Generalized Path Queries - gpqs. 

Property path expressions in SPARQL are also motivated by the need for 
graph traversal queries. However, they are fundamentally different from path 
queries in that the result of a property path expression is not paths but rather 
sets of endpoint nodes connected by paths that match the property path pat- 
tern. G-Core [15] presents a good discussion of different classes of graph queries. 
Existing graph-based query engines such as Neo4j [9], StarDog [13], Allegrogaph 
[14], AnzoGraph [6], Virtuoso [18] provide varying degrees of support for path 
querying. Some other platforms such as [19,21—24,30] have focused exclusively 
on the path querying. 

A common thread across existing path querying evaluation strategies is that 
they are built on traditional graph algorithms. The challenge with graph theo- 
retic interpretations of such queries is that the different constraints in gpqs may 
translate to different classes of graph problems, requiring different algorithms. 
For example, shortest path algorithms vs. subgraph isomorphism algorithms vs. 
subgraph homeomorphism, etc. From the point of view of query processing, this 
is a limited approach because of the limited opportunity for decomposition and 
reusability. On the other hand, adopting an algebraic perspective allows prob- 
lems to be interpreted in a more generalized form. This also allows for more nat- 
ural integration with algebraic graph pattern query engines. Considering such 
a strategy makes sense once one observes that gpqs are essentially comprised 
of four elements: graph pattern matching, joining/filtering of graph patterns, 
path computation, path filtering. Some existing platforms like [13] do partially 
interpret gpq-like queries algebraically. However, the absence of a complete alge- 
braic query interpretation framework results on falling back on traditional graph 
algorithms in many situations. 

In this paper, we propose an algebraic query evaluation technique for gpqs 
that delineates the four gpqs subquery elements and their mapping to algebraic 
query operations so that gpqs query planning translates to composition and 
ordering of query operations. More specifically, the paper presents. 
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— a conceptual query evaluation model that integrates algebraic graph pattern 
matching with algebraic path problem solving. 

— an implementation model that perturbs the plan for graph pattern matching 
query generated by a SPARQL query compiler by splicing in algebraic path 
querying operators to produce a gpqs query plan. Another advantage of this 
strategy is that current SPARQL parsers and existing graph pattern matching 
compiler can be adopted without modification. An example implementation 
strategy using Apache Jena’s query compiler and Apache Tez’ DAG for phys- 
ical execution is presented. 

— comparison of the performance and expressiveness of the integrated platform 
with a popular engine. 


Section 2 presents the background on algebraic path problem solving and 
graph pattern matching. The relevant work, existing graph querying engines and 
their limitations are provided in Sect.3. Section 4 discusses our approach both 
conceptually as well as the implementation model with evaluation presented in 
Sect. 5. Conclusion is in Sect. 6. 


2 Background 


2.1 Algebraic Path Problem Solving in Directed Graphs 


We begin with a brief review of an efficient algebraic path problem solving app- 
roach due to [34]. An edge e in a directed labeled graph G = (V, E) is denoted 
as e = (v1,v2) with label A(e) = le, where v1,v2 € V and e € E. A path p, 
in this graph G = (V, E), is defined as an alternating sequence of nodes and 
edge labels terminating in a node p = {v1,le,,V2,le5, +++) Uns le, ,Un+1}, Where 
V1, V2, +05 Un, Un+1 E€ V and e),€2,...,€n E E. A path expression of type (s, 
d), PE(s, d) [34] is a 3-tuple (s, d, R), where R is a regular expression over 
the set of edges defined using the standard operators union(U), concatenation(e) 
and closure(*) such that the language L(R) of R represents paths from s to 
d, where s,d € V. For example in Fig. 1(a) borrowed from [21], PE(2, 7) = 
(2,7,((bece f) U (ie f))) is path expression of type (2,7) (for brevity, only 
edges are captured in the regular expression, no nodes). Path expressions may or 
may not be complete in terms of the subset of paths represented. For example, 
PE(2, 7) only represents two of the several paths between 2 and 7. 

If a graph is ordered using any numbering scheme, path information can 
be represented using a particular ordering of path expressions called a Path- 
Sequence (PS) [21,34]. Figure 1(b) shows the path-sequence that represents the 
example graph in Fig. 1(a). It can be observed that some path expressions are 
simple, e.g. representing only a single edge, while others are more complex. 
The formalization of a path sequence [34] defines what path expressions are in 
a path sequence. A particularly appealing property of a path-sequence is that 
many path problems can be solved using a simple propagation SOLVE algorithm 
[21,34], that assembles path information as it scans the path-sequence from left 
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PE(2, 7) = (2, 7, ((bece f) U (i e f))) 
(c) 


Fig. 1. Example explaining path-sequence and expression (a) Example graph (b) Path- 
sequence for the graph (c) Partial path expression for paths between nodes 2 to 7 


to right. At every iteration of the SOLVE algorithm the following step is per- 
formed PE(s, wi) U(PE(s, vi) e PE(v;,w:))— > SAlw;], where an existing path 
expression for (s, w;) is extended using concatenation of two subpath expressions 
and/or union of new path expression capturing additional paths for (s, w;). At 
the end of the scan and propagation phase, we are guaranteed completeness of 
the source node used to drive the propagation phase. The original single source 
SOLVE algorithm was generalized in [21] to multiple sources, with a particu- 
lar emphasis on sharing computation across sources where subexpressions were 
common. 

One of the main issues with graph computation is that every problem requires 
a different algorithm. A nice property of this algebraic framework as shown by 
[34] is that multiple path problems can be solved using the same algorithm by 
interpreting Union(U) and concatenation(e) operators appropriately. For exam- 
ple, the shortest path problem has a very straightforward interpretation in terms 
of the Union(U) operator, where rather than union multiple path expressions you 
ignore all but that with the least cost. Some problems can also be interpreted 
in terms of manipulation of the path expression produced by the unconstrained 
path problem. [17,21,23] all describe some examples. For problems in this cat- 
egory, a critical issue is that computationally efficient representations of path 
expressions are used rather than mere string representations. For example, there 
is a natural mapping from regular expressions to abstract syntax trees (AST) 
where the operators like union(U) and concatenation(e) form the internal nodes 
while the edges form the leaves of the tree. Figure l(c) shows the AST for the PE 
shown earlier. In this context, path filter operators can then be defined in terms 
of manipulation of path expression representations. 
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2.2 Algebraic Query Evaluation of Graph Pattern Matching 


It is well known that RDF admits a directed graph model. SPARQL [29] is 
the standard RDF query language with its main query primitive being a graph 
pattern. Evaluation of graph pattern matching query is usually performed using 
operators with an algebraic query plan where we typically use relational-like 
query operators. The graph patterns are compiled into an algebraic logical plan 
representation, which is generally a sequence of query operators with an implied 
execution ordering. For example, Jena ARQ [4,28] is a popular query engine that 
supports SPARQL queries and it creates a SPARQL Syntax-Expression(SSE) as 
an algebraic logical query plan. The last step in query evaluation is transforming 
the logical plan to a physical plan which depends on the physical execution 
environment. 


3 Related Work 


[16,35] provides a good survey of graph query languages. For running queries that 
have both graph pattern matching and path computation components, in most 
cases, users have to use two different platforms. Those platforms that do allow 
both components mostly focus on finding shortest paths and not necessarily all 
paths. Platforms like Virtuoso [18], RDFPath [30], Blazegraph [5] use property 
paths [12] supported by SPARQL 1.1 [25]. However, using property paths, it is 
only possible to know the specific sources and destination, but not the exact 
paths. Also, the users would need to write a regular expression of the properties 
in the paths they are looking for, requiring the user to know the exact properties 
in the path as well as have some idea of the sequence of these properties. Gremlin 
[32], the query language for JanusGraph [8] and Neptune [2] also requires the 
predicates of the path to be specified in the query. Oracle’s PGQL [10, 11,31, 33] 
finds paths using general expressions over vertices and edges of the graph. The 
user needs to have knowledge of the sequence of edges in the paths being searched 
for in this case as well. 

Neo4j [9], AgensGraph [1] use Cypher [7,20] as their query language. Cypher 
uses a fast bidirectional breadth-first search algorithm for optimizing path 
queries. However, this fast algorithm is used only in certain scenarios like finding 
shortest path. When finding all paths, Cypher uses a much slower exhaustive 
depth-first search algorithm. Even for shortest path queries, the fast algorithm 
is used only if the predicates in the path query can be evaluated on the fly. For 
path queries, with predicates for which they need to examine the whole path 
before making a decision on filtering, Cypher’s query evaluation falls back to 
exhaustive search. Cypher has another drawback, where its shortest path algo- 
rithm produces incomplete results when the start and end nodes are the same. 
Such a scenario might occur when performing a shortestPath search where the 
sources and destinations are overlapping sets of nodes. 

Stardog [13] uses more traditional SPARQL operators for query evaluation. 
For any path query with start and end variable patterns Stardog first finds all 
possible paths that match PQ which is a regular expression similar to that used 
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Original query SSE produced by Jena after parsing and compiling 
PREFIX akt:<http://www.aktors.org/ontology/ (prefix ((akt:http://www.aktors.org/ontology/portal#) 
portal#> (rdf :http://www.w3.org/1999/02/22-rdf-syntax-ns#) ) 
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf- (project (?s1 7s) 
syntax-ns#> (product 
(join 
SELECT * WHERE { (BGP 
?s1 rdf:type akt:Affiliated-Person . [triple ?s1 rdf:type akt:Affiliated-Person] 
?s1 akt:full-name "Wendy E. Mackay" . [triple ?s1 akt:full-name "Wendy E. Mackay"]) 
?s akt:has-author ?s1 . (BGP 
?s2 akt:full-name "Irene Greif" . [triple ?s akt:has-author ?s1] 
?s2 akt:has-affiliation 7d . )) 
(BGP 
?s ?pathVar ?d . [triple ?s2 akt:full-name "Irene Greif"] 
T [triple ?s2 akt:has-affiliation ?d]))) 


(a) (b) 


Fig. 2. (a) An example path query in our implementation of the integrated platform. 
(b) The SSE produced by Jena’s parser and compiler 


by property paths. The resulting set of paths is then joined with the end graph 
pattern, followed by the start graph pattern. This approach of applying filter 
first and then joining with source and destination patterns might be useful when 
the filter is highly restrictive. However, if the path query filter is not restrictive 
it will produce a large resultset resulting in poor performance when joining with 
the start and end patterns. 


4 Approach 


Introducing a new query class would typically require the extension of query 
language and processing framework. However, we adopted an approach of intro- 
ducing a syntactic sugar that avoided the need for changing SPARQL’s query 
syntax. A second simplifying but reasonable strategy is the use of a fixed order 
between the graph pattern matching phase and the path computation phase. 
The rationale here is that in gpqs, pattern matching serves to compute the set 
of sources, destinations and/or intermediate nodes in constraints. In other words, 
the output of graph pattern matching can be seen as input to the path problem 
phase. Interpreting this in terms of query plans implies that the path computa- 
tion and path filter operators will always be at the root of the tree for any gpqs 
query plans. In the sequel, we elaborate our realization of the above implied 
strategy. 


4.1 Identifying GPQ Sub-Query Components in SPARQL* Queries 


Our syntactic sugar is based on adopting a pre-defined variable name ? path Var 
as the path operator. We acknowledge the risk of other users using this variable 
in their queries, but assume this risk to be small. Since this a legal variable that is 
recognized by the graph pattern matching platform’s parser, the unaltered parser 
can parse and compile path queries without failing due to syntax issues. Here, 
we refer to SPARQL with our pre-defined variable ?pathVar as SPARQL*. 
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Implementation Strategy: In this section, we describe the approach followed 
to identify the source and destination variables using the pre-defined path vari- 
able ?pathVar and then project them out from the graph patterns. The last 
triple pattern in example query in Fig. 2(a), (?s ?7pathVar ?d) denotes the path 
computation between all bindings to the variable ?s and the variable ?d. Pres- 
ence of the ?pathVar variable in the predicate position implies that it is a path 
query. Now, we must keep track of the position of the source and destination 
variables in the graph patterns and finally, after all the joins have taken place 
we must project out only the bindings of the source and destination variables. 
These bindings would then go into the path operator. To do this, we create the 
required datastructures to hold the position information of the source and des- 
tination variables in the query. This information will be later required when we 
create the final physical plan of the query. 

For our proof-of-concept prototype, we implemented by integrating Sem- 
storm [27] as the graph pattern matching platform and Serpent [21,23] as the 
path query computation platform. Semstorm uses the below two main datas- 
tructures as query plan representation to hold the position information of the 
different triple patterns in the submitted query. 


— subjObjListMap holds the mapping between the subjects and the corre- 
sponding objects in the query. The subjObjListMap for the query in Fig. 2(a) 
would be 


subjObjListMap: {?s=[[?s1]], ?si=[["Wendy E. Mackay"]], 
?s2=[["Irene Greif"], [?d]]} 


— subjPropListMap holds the mapping between the subjects and the prop- 
erties or predicates in the triple patterns in the query. The subjPropListMap 
of the query in Fig. 2(a) would be 


subjPropListMap: {?s=[has-author], ?s1=[full-name] , 
?s2=[full-name, has-affiliation]} 


In addition to these, the following datastructures have been added to facil- 
itate path computation and provide required location information to the path 
operator. 


— pathSrcDst is a map that shows the mapping between the source variable 
and its corresponding destination variable. For the query in Fig. 2(a), the 
pathSrcDst would be 


pathSrcDst: {?s=[?d]} 


— srcMap contains the source variable in the key position and a list of inte- 
gers in the value position. The list of integers denote the exact position of 
the source variable in the subjObjListMap datastructure. The srcMap of the 
query in Fig. 2(a) would be 


srcMap: {?s=[[0, -1]]} 
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— dstMap is similar to the srcMap, except that its key contains the destination 
variable and the list of integers in its value position denote the position of 
the destination variable. The dstMap of the query in Fig. 2(a) would be 


dstMap: {?d=[[2, 1]]} 


— cndMap is also same as the srcMap and dstMap except that it hold the con- 
straints information. For example, some query might want to restrict paths 
to the ones which contain at least one akt:has-affiliation property or pred- 
icate. Then, this triple will be a part of the constraints and the position of 
this triple would be captured in the cndMap. The query in Fig. 2(a) is not a 
constrained query and hence, its cndMap would be empty. 


The list of integers in the value position of the srcMap, dstMap and 
cndMap all denote the position of the respective variables in subjObjListMap. 
For example, {?s = [[0, — 1]]} means the variable ?s is in the first BGP of sub- 
jObjListMap (indexing starts at 0) and —1 denotes that it is the subject of the 
BGP. {?d=|[2, 1]]} means that the variable ?d is in the third BGP and it is 
the second object of that BGP. Sometimes these variables might also be the join 
variable between two graph patterns and so, they can exist in multiple BGPs 
and the value of the respective maps will have a list of integer pairs, identifying 
the position of the variable in subjObjListMap. 


4.2 Logical Query Plan Transformation 


Our query planning approach is based on transforming the query plan produced 
by graph pattern matching engine. The intuition is that the subqueries which 
are the graph patterns defining the sets of sources, destinations, etc for path 
computation can be translated to query plans in the usual manner. However, 
the semantics of such queries will usually imply a cross-product of intermedi- 
ate results (since the subgraph patterns will be disconnected). We illustrate this 
idea with the example query in Fig.2(a) (but ignoring the last triple pattern 
(?s ?pathVar ?d) which is our syntactic sugar for the path variable triple pat- 
tern). Figure 2(b) shows the SSE created by Jena’s parser and compiler and 


project ' path filter 


product is. ' path operator 


Fig. 3. Query plan transformation from graph pattern matching query to ggpq 
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Fig. 3(a) shows the SSE as a tree. To achieve the correct query semantics, the 
cross-product and projection operators have to be removed and query operators 
associated with path computation introduced. The final operator in the plan 
is a path filter operator (if path filtering constraints are specified - absent in 
example). The newly introduced components of the query plan are enclosed in 
a dotted box in Fig. 3(b). 


Implementation Strategy: Our graph pattern matching platform, Semstorm 
[27] is an RDF processing platform that is targeted for Cloud-processing and 
uses Apache Hadoop/Tez execution environment. Semstorm’s compiler builds 
on Jena’s parser, using Jena’s SSE to create a Tez [3] DAG as the physical 
query plan based on Semstorm’s query algebra. To achieve an equivalent physical 
query plan transformation, similar to the logical plan transformation in Fig. 3, 
new physical query operators have to be introduced. Since our physical execution 
environment is Tez, the new physical operators are nothing but new Tez Vertices. 
The following new Tez vertex types were added that act as the physical query 
operators. 


— Annotator Vertex for Source, Destination and Constraint Variables. 
Semstorm is meant to run SELECT * WHERE queries and so, it propagates 
the data for all of the variables in the query. However, the path computation 
platform Serpent expects three lists of nodes that denote sources, destinations 
and constraints respectively. Hence, annotators were required to identify the 
source, destination or constraint variables and then, allow only the bindings 


PathComputer:: 
Src: ?s, Dst: ?d 
Annotator - Annotator - 
Source: [[0, -1]] Destination: [[2, 1]] 


Fig. 4. The Tez DAG representating the physical plan for the query shown in Fig. 2 
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for that variable to pass through, discarding the rest of the bindings. While 
this might seem to be a less optimized method, it must be noted that bindings 
to other variables cannot be discarded before all joins have completed since 
the source, destination or constraint variable may not always be the join 
variable. 

— PathComputer Vertex. This is the path operator which performs the path 
computation. It takes the sources, destinations and constraints as input, con- 
verts these into three String arrays as is required by Serpent and then calls 
the appropriate method in the Serpent platform. For every path query DAG 
this vertex will always be at the root. 


Figure 4 shows the final Tez DAG that needs to be generated for the exam- 
ple query in Fig.2(a). The TypeScanner::?s vertex in the DAG identifies and 
reads all triples that match the pattern {?s akt:has-author ?s1} from the 
data file. Similarly, the other TypeScanner vertices read the respective matching 
triples. The output of the TypeScanner::?s and TypeScanner::?s1 vertices go to 
Annotator(?s1):0 and Annotator(?s1):1 vertices respectively. These annotator 
vertices identify the join variable and its position in the graph pattern. The Pack- 
ager::?s,?s1 vertex performs the actual join operation between the two graph 
patterns and provides the joined output of the two input graph patterns. 

If a simple pattern matching query is submitted to Semstorm, it would add 
a Producter vertex that would take inputs from the Packager::?s,?s1 and Type- 
Scanner::?s2 vertices and the output of the Producter vertex would go into a 
Flattener vertex which would write out the final query output to an output file 
on disk. In our integrated version, we created a fork at this point, where for a 
path query, we do not add the Producter and Flattener vertices. After joins, we 
add the Annotator- Source:[/0,—1]] and Annotator-Destination:[[2,1]] to anno- 
tate the source and destination respectively. We also add the value from the 
srcMap and dstMap to the respective vertex name. Since the destination vertex 
in our example is not involved in any joins the destination annotator vertex gets 
its input directly from the typeScanner vertex that has the destination variable. 
The PathComputer::Src:?s, Dst:?d vertex comes at the root of the DAG since 
this will be executed last. While creating this vertex, information about the 
source variable ?s and destination variable ?d are added to it using the config- 
uration payload. This DAG is submitted to the query execution framework of 
Semstorm, which executes the DAG and produces the final path output. 


4.3 Path Constraints 


Some path constraints can be evaluated by reinterpreting the union and concate- 
nation operations during the propagation algorithm (SOLVE) e.g. for shortest 
paths. Others will be defined as manipulations over the path expression pro- 
duced by unconstrained version of the problem e.g. finding paths that contain 
a given set of nodes (no order specified). Those manipulations will be encapsu- 
lated in operators that are parent nodes of the pathComputer node in operator 
plan tree. The efficiency of the such operations will depend on the nature of 
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Fig. 5. Size of source and destination sets for each query 


path expression representation e.g. a binary encoded representation. However, a 
detailed discussion path constraints is outside scope of this paper. 


5 Evaluation 


5.1 Test Setup 


The primary goal of our evaluation was to compare our integrated system with 
an existing platform on the following parameters. 


1. Query compilation time comparison for our platform with and without path 
operator. 

2. Performance, i.e., time taken to run the same queries. 

3. Completeness of results, i.e., whether the platform returns all paths expected. 

4. Expressiveness, i.e., what level of queries can be expressed in each platform. 


Dataset and Queries: Our queries were ran on the BTC500M dataset [26] (size 
0.5GB, 2.5 million triples). While formulating queries, we focused on finding 
paths that are at least three hops long. The queries we ran varies from small set 
of sources and destinations to very large set of sources and destinations. We ran 
five small queries and five large queries where small and large indicate the size of 
the set of sources and destinations shown in Fig. 5. In the charts Small Queries 
and Large Queries have been abbreviated to SQ and LQ respectively. The same 
queries were modified to add constraints to run constrained query experiments. 

All the comparisons have been done with Stardog. We also considered Neo4j, 
but while trying to run queries using Cypher we found that all-paths queries on 
this dataset were running indefinitely and causing the Neo4j server to crash. We 
were able to run shortest path queries on Neo4j but that result is not included 
in this paper as finding shortest path was not an evaluation goal for this paper. 


Hardware Configuration: Evaluation was conducted on single node server 
running HDFS in a privately owned RedHat Enterprise Server server, housed in 
the University’s server lab. The server is equipped with Xeon octa core x86_64 
CPU (2.33 GHz), 40GB RAM, and two HDDs (3.6 TB and 445 GB). All results 
have been averaged over five trials. In all the charts our platform has been 
labelled as “Sem-Ser”. 
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5.2 Evaluation Results 


Query Compilation Time Comparison: Figure 6 shows the time taken for 
query compilation on our platform for queries which have the path operator 
compared with the same queries without the path operator. The path operator 
does not have much effect on the query compilation time and in most cases the 
compilation time increased by less than one second. 


Query compilation time comparison 
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Fig. 6. Chart showing compilation time comparison 


Performance Evaluation: When comparing absolute time taken by our plat- 
form with that of Stardog, we found that Stardog performed better in all queries 
except for SQ1, LQ; and LQ2. LQə2 timed out and produced only partial results 
on Stardog and took the longest time (5.5 min) and produced the largest number 
of paths (0.8 million paths) on our platform. This is mainly because the graph 
patterns provided for the source and destination nodes was quite general, thus, 
leading to large number of matching sources and destinations. Consequently, 
there were a large number of paths connecting these nodes. 


Completeness of Results: Figure7(a) and (b) show the number of paths 
identified by small and large queries respectively. LQə2 has been marked with an 
asterix since it did not finish in Stardog and hence, all the charts have only one 
value for this query. For all the queries, Stardog produced incomplete results 
and also duplicate paths. This dataset has a lot of triples such as (acm:58567 
akt :has-publication-reference acm:58567). In this triple the subject and 
the object is the same uri acm:58567 and hence, this is called a loop or self-loop. 
The BTC dataset has a lot of such triples and Stardog does not consider the 
self loops in the paths it identifies. For example, suppose we have an RDF graph 
consisting of the triples (A p1 A) (A p2 B) (B p3 B) anda path query with A as 
source node and B as destination node. On execution of the path query Stardog 
will ignore the self-loops (A p1 A) and (B p3 B} and will output only one path (A 
p2 B). However, our platform will find four paths (A p2 B), (A p1 A p2 B), (A 
p2 B p3 B)and(A pi A p2 B p3 B). This is the reason behind Stardog mostly 
finding less paths as compared to our platform. 
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Fig. 7. Chart showing comparison of number of paths identified 


In some queries (SQ, LQ1), Stardog does find more number of paths. How- 
ever, these results contain duplicate paths. For example, although Stardog pro- 
duces 40 paths for SQ, the number of unique paths is 6. Since there was a huge 
mismatch between the number of paths found by our platform and Stardog we 
compared the time taken per path identified rather than the absolute time taken 
for executing each query. Figure 8(a) and (b) shows the time per path comparison 
for the small and large queries respectively. 


Expressiveness: All types of graph patterns can be expressed in Neo4j, Stardog 
as well as our platform. However, Stardog does not support constraints such as 
ALL, ANY, NONE. Figure9 shows the comparison of the expressiveness of our 
platform with that of Stardog and Neo4j. Neo4j has predicate functions (all, 
any, exists, none, single) which can be used for the same purpose of filtering. 
However, since we were not able to run all paths queries on Neo4j it was not 
possible to compare constrained queries on our platform with that on Neo4j. 
Figure 10 shows the time taken for constrained queries as compared to uncon- 
strained queries on our platform. All of the constrained queries understandably 
taken longer time to complete query execution, since these queries include an 
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Fig. 8. Chart showing the comparison of time taken per path identified 
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Query Sem-Ser Stardog Neo4j 


Pattern matching Query Y Y Y 
Unconstrained Path Query Y Y Y 
Constrained Path Query Y X Y 


Fig. 9. Table showing comparison of the level of expressiveness of our platform with 
Neo4j and Stardog 
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Fig. 10. Execution Time of constrained queries vs unconstrained queries 


extra filtering step. For all of the small queries, the increase in execution time is 
minimal mainly because the size of the resulting set of paths before filtering is 
also small. For the large queries, the increase in execution time is more noticeable 
due to the larger size of the resultset prior to filtering. 


6 Conclusion 


This paper presents an algebraic query evaluation strategy to evaluate general- 
ized path queries with declaratively defined source and destination nodes. This 
paper also presents a general framework and steps to integrate any existing graph 
pattern matching platform with a path computation platform. Lastly, this paper 
describes an implementation of such an integrated platform and shows perfor- 
mance comparison with this integrated platform with that of popular platforms 
that can handle such generalized path queries. 
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Abstract. SciGraph is a Linked Open Data graph published by 
Springer Nature which contains information about conferences and con- 
ference publications. In this paper, we discuss how this dataset can be uti- 
lized to build a conference recommendation system, yielding a recall@10 
of up to 0.665, and a MAP of up to 0.540, generating recommenda- 
tions based on authors, abstracts, and keywords. Furthermore, we show 
how the dataset can be linked to WikiCFP to recommend upcoming 
conferences. 


Keywords: Recommender system - SciGraph - Scientific publications 


1 Introduction 


Bibliographic datasets form a major topic in the Linked Open Data Cloud!, 
accounting for a total of 12-13% of all datasets [15]. One of those datasets 
is SciGraph?, which is published by Springer Nature and is the successor of 
Springer’s Linked Open Data Conference Portal [3], comprising 7.2M articles 
and 240k books published by Springer Nature, and totaling to 1B triples. 

In this paper, we aim at exploiting SciGraph to provide users with recom- 
mendations of conferences to submit their publications to, utilizing SciGraph for 
information on past conferences and publications, and WikiCfP for information 
on upcoming conferences. 


1 https: //lod-cloud.net /. 
? https: //www.springernature.com/de/researchers/scigraph. 
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2 Related Work 


The idea of building recommender systems for scholarly content goes back almost 
20 years [2,7]. More recently, Linked Open Data has been recognized as a valu- 
able source for building recommender systems. In particular, content-based rec- 
ommender systems, which focus on the items to be recommended and their 
interrelations, can benefit strongly from detailed descriptions of those items in 
open datasets [4,5]. 

Similar to the task in this paper, several approaches have been proposed 
for the recommendation of research papers (see [1] for a comprehensive survey). 
Although sharing the same domain, the setup is slightly different here — in our 
scenario, both the input data (i.e., authors, a textual abstract, keywords), and 
the prediction target (conferences instead of individual papers) are different. 


3 Approach 


3.1 Datasets 


The main dataset used to train the recommender system is SciGraph. For train- 
ing, we use publications from the years 2013-2015, whereas for evaluation, pub- 
lications from the year 2016 are used. In total, SciGraph contains 240,396 books, 
however, only a fraction out of those correspond to the proceedings of a single 
conference. Moreover, it contains 3,987,480 individual book chapters, again, a 
fraction of which correspond to papers published at conferences. Additionally, 
SciGraph provides a taxonomy of research topics, called Product Market Codes 
(PMCs). In total, 1,465 of those PMCs are included in the hierarchy and assigned 
to books. Only 89 of those PMCs are related to computer science. 

The second dataset we use is WikiCfP*, a website which publishes calls for 
papers. Since there is no downloadable version of the data (although the CC-BY- 
SA license allows for reusing the dataset), we built a crawler to create a dataset 
of CfPs, containing names, acronyms, dates, locations, and submission deadlines 
(which we consider mandatory attributes), as well as links to the conference page, 
the conference series, categorization in WikiCfP, and textual description (which 
we consider optional attributes). Overall, we crawled data for 65,714 CfPs in July 
2018. The crawled data was linked to SciGraph using string similarity between 
the conference names. This leads to 53.1% of the CfPs linked to SciGraph. 


3.2 Recommendation Techniques 


We use three main families of recommendation techniques, i.e., recommendations 
based on authors, abstracts, and keywords. Furthermore, we also use an ensemble 
strategy. Generally, the recommendation strategies either exploit some notion 
of similarity (e.g., recommending conferences which contain publications with 
similar abstracts), or model the problem as a machine learning problem (i.e., 
since we have 742 conference series in our training set, we train a multi-label 
classifier for 742 classes). 


3 http: //www.wikicfp.com. 
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Author-based recommendations are computed based on the authors of an 
application. Essentially, we count the number of papers per conference series 
which share at least one author with the authors given in the abstract, and use 
that count as a score.* 


Abstract-based recommendations compare the abstracts of publications in 
SciGraph with the abstract given by the user. Overall, we use two different 
approaches: the maz strategy finds single publications with the highest abstract 
similarity and proposes the corresponding conference, while the concat strategy 
concatenates all abstracts related to a conference to a virtual document, and 
compares the given abstract to those virtual documents. 

Different variants for generating recommendations are used. We utilize stan- 
dard TF-IDF, as well as TF-IDF based on word n-grams, LSA and LSA based 
on word n-grams [10], and pLSA [6]. Furthermore, we utilize similarity based on 
word embeddings, based on word2vec [11], GloVe [13], and FastText [8], using 
both pre-trained embeddings, as well as embeddings trained on the SciGraph 
collection of abstracts. While all those approaches are based on similarities, we 
also tried directly predicting the conferences using a convolutional neural net- 
work (CNN) approach, which takes the self-trained word2vec embeddings as 
representations for words, as discussed in [9]. 


Keyword-based recommendations are based on Product Market Codes in 
SciGraph. Such product market codes are defined by Springer Nature and resem- 
ble other categorization systems in computer science, such as the ACM comput- 
ing classification system. A second keyword-based model uses a script to identify 
Computer Science Ontology (CSO) [14] terms in the abstract entered by the user. 


4 Evaluation 


As sketched above, publication data from 2013-2015 were used as training data 
for the recommender system, whereas publications from 2016 were used for test- 
ing. For each publication in the test set, we try to predict the conference at 
which it has been published, and compare the results to the gold standard (i.e., 
the conference in which it has actually been published). We create 10 recommen- 
dations with each technique’, and report recall@10 and mean average precision 
(MAP). 

Table 1 shows some basic statistics of the training and test set. In total, 
the recommender system is trained on 742 conference series and 555,798 papers 
written by 110,831 authors. As far as the abstracts are concerned, only little 


4 We do not disambiguate authors here, since no further clues for disambiguation, 
such as organizations, or unique IDs, such as ORCID, are present in SciGraph. 

5 The only exception are recommendations based on authors, which may create shorter 
lists in cases where all authors altogether have published at less than 10 conferences 
contained in SciGraph. 


120 A. Iana et al. 


Table 1. Characteristics of the training and test set 


Training (2013-2015) | Test (2016) | Overlap 
Distinct conference series ID 742 526 405 
Distinct author names 110,831 53,862 | 20,529 
Product market codes 155 150 115 
Papers 555,798 200,502 = 
English abstracts 57,797 21,323 — 


Table 2. Results of the best performing individual recommendation techniques. For 
each individual technique, we only report the results of the best performing strategy 
(max or concat). 


Method Recall@10 |) MAP 
Author-based 0.372 | 0.284 
Abstract-based TF-IDF (concat) 0.461 | 0.237 
Abstract-based n-gram TF-IDF (concat) w/ cosine similarity 0.490 | 0.270 
Abstract-based n-gram TF-IDF (concat) w/ Multinomial Naive Bayes 0.494 | 0.273 
Abstract-based LSA (concat) 0.461 | 0.237 
Abstract-based n-gram LSA (concat) 0.490 | 0.270 
Abstract-based pLSA (concat) 0.369 | 0.172 
Abstract-based Glove pre-trained (max) 0.229 | 0.097 
Abstract-based word2vec self-trained (max) 0.346 | 0.154 
Abstract-based word2vec plus CNN (concat) 0.405 | 0.201 
Abstract-based doc2vec (concat) 0.352 | 0.164 
Keyword-based SciGraph market codes (max) 0.665 0.522 
Keyword-based CSO (max) 0.201 | 0.081 
Ensemble TF-IDF & word2vec plus CNN (10) 0.498 | 0.250 
Ensemble TF-IDF & word2vec plus CNN & SciGraph market codes (10) 0.648 | 0.509 
Ensemble TF-IDF & word2vec plus CNN & SciGraph market codes (100) 0.662 | 0.539 
Ensemble TF-IDF & word2vec plus CNN & SciGraph market codes (1,000) 0.661 | 0.540 


more than 10% of all the papers have an English language abstract. The average 
length of an abstract is 136 words. 

Table 2 summarizes the results of the best performing models for recommen- 
dations based on authors, abstracts, and keywords. Generally, abstracts work 
better than authors, and keywords work better than abstracts. For abstracts, 
TF-IDF using single tokens yields a recall@10 of 0.461 and a MAP of 0.237. For 
using TF-IDF with n-grams, we explored different variants: we varied the upper 
limit for n between 2 and 5, and evaluated the approach with the 500k and 
1M most frequent n-grams, as well as with all n-grams. The best results were 
obtained when using the 1M most frequent n-grams of size 1 to 4, outperforming 
the standard TF-IDF approach. 


6 For a larger fraction of papers in SciGraph, no abstract is contained in the dataset. 
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In addition, we also evaluated a few ensemble setups. These were built by 
combining recommendation lists of length 10, 100, and 1,000, given by different 
base recommenders, and using a logistic regression as a meta learner [16] to 
generate a recommendation list of length 10 as in the setups above. We can 
observe that combining two abstract-based techniques (TF-IDF and word2vec 
plus CNN, which were very diverse in their predictions), outperforms the two 
individual techniques in both recall@10 and MAP. 

Building ensembles incorporating SciGraph market codes yields no signifi- 
cantly better results than using keywords alone, demonstrating that those key- 
words are in fact the most suitable indicator for recommending conferences. 
Generally, extending the base recommendation lists beyond 100 elements does 
not change the results much, because conferences predicted on a position higher 
than 100 are unlikely to be considered in the final result list of size 10. 

The recall figures reported in Table2 do not exceed 0.665, but this result 
should be considered in a broader context. In total, only 77% of all conferences in 
the test set are also contained in the training set, i.e., we do not have any training 
signals for the remaining conferences. Since we can only use previous publications 
of proceedings for generating training features, the approaches discussed in this 
paper can only recommend conferences known from the training set, i.e., the 
maximum recall we could reach with these methods would be 0.815. 

In general, we can see that keyword-based models are the best performing 
ones. However, they are also the least user-friendly ones, since product market 
codes are assigned by editors at Springer Nature (more recently, using automated 
tools [12]). While end users might be able to assign them at a decent quality, 
the actual recommendation quality with user-assigned keywords might actually 
be lower than the one based on editor-assigned product market codes. Another 
possible issue is that by selecting up to seven keywords out of 1,465, one could 
easily create pseudo-keys for conferences (i.e., each conference can be uniquely 
identified by its keywords), so overfitting might also be an issue for those models. 

Another observation we have made in our experiments is that there is a strong 
bias towards machine learning and neural networks related conferences. As the 
corpus is focused on computer science conferences, and the training dataset is 
from the past few years (an informal inspection of the data in SciGraph yielded 
that roughly half of the papers in the graph are related to artificial intelligence), 
this topic is over-represented in our training dataset. Hence, the system is likely 
to create more recommendations for such conferences. 


5 Conclusion 


In this paper, we have introduced a recommendation system for conferences, 
based on abstracts, authors, and keywords.’ The system can be used by authors 
searching for upcoming conferences to publish at. The recommendations are 
computed based on SciGraph, with submission deadlines added from WikiCfP. 


T A prototype is available at http://confrec.dws.uni-mannheim.de/. 
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We have observed that the best signal for creating recommendations are 
keywords, in particular market codes in SciGraph, which, however, are not often 
easy to select for laymen users. With those keywords, a recall@10 of up to 0.665 
and a MAP of up to 0.522 can be reached. Recommendations based on authors 
(recall@10 of 0.372 and MAP of 0.284) and abstracts (recall@10 up to 0.494, 
MAP up to 0.273) are clearly inferior, where the best results for the latter 
are obtained with TF-IDF based on word n-grams. Moreover, the good results 
obtained with vector space embeddings pre-trained on other text categories (e.g., 
news articles or Wikipedia texts) could not be reproduced on a target corpus of 
abstracts of scientific texts from various research fields. 
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Abstract. There is a great potential in creative industries, such as archi- 
tecture and video game design, for re-using and re-purposing of digital 
content. Paintings, archival footage, documentaries, movies, reviews or 
catalogues, and various other forms of artwork can serve as sources of 
inspiration and design direction towards innovative designs and new con- 
cepts. In this paper, we present V4Ann, an ontology-based framework 
for semantically representing, aggregating and combining annotations 
(atoms) coming from visual and textual analysis of digital content. The 
aim is to structure and link data in such a way so as to facilitate the sys- 
tematic process, integration and organisation of information and estab- 
lish innovative value chains and end-user applications. The framework is 
part of the V4Design platform that aims to re-use and re-purpose existing 
heterogeneous multimedia content by semantically enriching and trans- 
forming assets into a 3D representation, so as to inspire and support the 
design, architecture, as well as 3D and VR game industries. 


Keywords: Annotations - Ontologies - Reasoning - Semantic 
enrichment - Multimodal data 


1 Introduction 


Vast amounts of multimedia content is being produced, archived and digitised, 
resulting in great troves of data of interest. Examples include user-generated 
content, such as images, videos, text and audio posted by users on social media 
and wikis, or content provided through official publishers and distributors, such 
as digital libraries, organisations and online museums. This digital content can 
serve as a valuable source of inspiration to the cultural and creative industries 
to produce new assets or to enhance and (re-)use the already existing ones. 
However, the re-use and re-purposing of digital content is mainly realised 
based on individual designers skills and a variety of non-interlinked heteroge- 
neous tools. To this end, the content remains largely under-exploited, despite its 
great potential for re-use and re-purpose, due to the lack of appropriate solutions 
for its retrieval and integration into the design process. For example, existing 
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heterogeneous multimedia content, such as video and images of buildings and 
objects, can be collected and transformed (e.g. into 3D models'), so as to inspire 
and support the creation of new content in creative industries. One of the main 
challenges in this area is to maximise the potential for re-purposing of digital 
content through the development of innovative technologies to systematically 
analyse, combine, link and foster searchability and reusability of heterogeneous 
multimedia content in different contexts. 

In this paper we describe V4Ann, an ontology-based framework for capturing 
and interlinking digital assets and duly annotations at two levels: (a) content 
analysis level, during which visual and textual content is analysed to extract 
labels, called atoms; and (b) retrieval and repurposing level, where the assets (e.g. 
3D models and images) are interlinked and contextually enriched to facilitate 
their discovery. At the content analysis level, V4Ann provides the conceptual 
structures to capture and interlink multimedia analysis results on digital content, 
such as video, image and text. During retrieval and repurpose, V4Ann provides 
practical retrieval capabilities, allowing users, e.g. game designers, to search for 
assets relevant to their needs. V4Ann is part of the V4Design platform?, enriching 
multimedia processing with a semantic annotation layer. 

The contribution of our research can be summarised in the following: 


— We describe a resource annotation model that implements the W3C standard 
for defining annotations (Web Annotation Data Model [17]). 

— We define a core set of rules that perform valid inferences for annotation 
propagation and interlinking, as well as for validity checking. 

— We propose an atom similarity metric along with a searching algorithm for 
keyword-based digital asset retrieval. 


The rest of the paper is structured as follows: Sect. 2 presents related work. 
Section 3 gives an overview of the framework and presents our motivation. In 
Sect.4 we describe the basic concepts of the V4Ann annotation model, while 
in Sect.5 we elaborate on the inference and validation capabilities. Section 6 
describes the atom similarity metric and the searching functionality. In Sect. 7 
we present evaluation results and, finally, in Sect.8 we conclude our work. 


2 Related Work 


Annotations are typically used to convey information about a resource or asso- 
ciations between resources. Simple examples include a comment or tag on a 
single web page or image, video or a blog post about a news article. In 2017, the 
Web Annotation Data Model (WADM) [17] became the W3C recommendation 
for defining annotations. It provides an extensible, interoperable framework for 
expressing annotations, such that they can easily be shared between platforms’. 


1 https: //pro.europeana.eu/ project /3d-content-in-europeana. 
? https: //v4design.eu/. 
3 https: //www.w3.org/TR/annotation-vocab/. 
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In the domain of digital libraries, the Europeana Data Model (EDM) [4] 
adopts an open and scalable approach that can accommodate the range and level 
of details of particular standards, such as LIDO for museums, EAD for archives 
or METS for digital libraries. EDM is not built on any particular standard, 
however it is conceptually in line with WADM and the ORE? initiative. 

The Open Provenance Model (OPM) [11] enables to specify what caused 
“things” to be, i.e., how “things” depended on others and resulted in specific 
states. In essence, it allows provenance information to be exchanged between 
systems, by means of a compatibility layer based on a shared provenance model. 
OPM predates PROV-O [9], and has a very similar approach to modelling prove- 
nance by relating agents, artifacts and processes and the concepts of OPM are 
covered by equivalent PROV-O concepts. PAV [3] extends PROV-O and specifies 
Provenance, Authoring and Versioning information. 

The Dublin Core metadata (DCMI) standard is a simple yet effective 
general-purpose set of 15 elements for describing a wide range of networked 
resources. Although DCMI favors document-like objects, it can be applied to 
other resources as well. The SKOS Core Vocabulary [10] is a model for expressing 
the basic structure and content of concept schemes. Specifically for multimedia, 
the Ontology for Media Resources® was developed by the W3C Media Annota- 
tions Working Group to identify a minimum set of core properties to describe 
and retrieve information about media resources. VidOnt [18] provides a formally 
grounded core reference ontology for video representation. Several attempts have 
been made to map the XML Schema of MPEG-7 to RDFS and OWL [19] and 
X3D to OWL (OntologyX3D [6]) and the 3D Modeling Ontology (3DMO’‘). 

V4DAnn aims to serve as the semantic annotation layer of multimedia pro- 
cessing results for fostering data exchange among analysis services and for human 
consumption. In order to promote interoperability and extensibility, it imple- 
ments the WADM pattern, introducing the concept of atoms and providing 
several annotation entities and properties. In contrast to existing models that 
mostly focus on metadata defined by data providers and curators, V4Ann aims 
to capture content analysis results (e.g. visual and textual analysis), serving as 
a semantic middleware for metadata exchange. For example, EDM views refer 
to digital representations, whereas in V4Ann a view represents an atom-based 
interpretation of a content analysis procedure, e.g. aesthetics extraction. How- 
ever, V4Ann provides alignments to conceptual structures of existing models, 
such as the EDM, ORE and SKOS (see Sect. 4 for more details). 

As far as semantic enrichment and retrieval are concerned, recent advances 
in machine learning and especially deep learning have provided us with tools 
like word representations (e.g. word2vec [20] and Glove [14]), which led to the 
development of more recent and powerful analysis models [15]. In addition, 
several approaches have been proposed for question answering over Semantic 


t http: //www.openarchives.org/ore/1.0/vocabulary. 
5 http: //dublincore.org/documents/dces/. 

6 http://www.w3.org/TR/mediaont-10. 

T http: //3dontology.org. 
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Web knowledge bases and Linked Data. Most of them generate one or more 
queries, while others opt for graph-based approaches to mitigate the rigidness 
often entailed in formulating appropriate SPARQL queries. Examples include 
EARL [5] and VoxEL [16]. V4Ann aims to provide a practical context enrich- 
ment framework to facilitate basic asset discovery, rather than proposing a fully 
fledged question answering framework. To this end, it introduces the notions of 
atom similarity and local contexts. 
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Fig. 1. The position of V4Ann in the integrated V4Design platform. 


3 Key Concepts and Motivation 


In a world where visual and textual data are in abundance, creative industries 
need to re-use and re-purpose them so as to remain competitive to other indus- 
tries and provide to society and creativity a novel financial prism. V4Design is 
an H2020 project that aims at exploiting state-of-the-art digital content analy- 
sis techniques to generate 3D models, extract aesthetic and stylistic information 
from paintings and videos, localise buildings and objects of interest within visual 
content, and integrate it with textual information so as to inspire and support 
the design, architecture, as well as 3D and VR game industries. 

V4Ann aims to enrich V4Design with a semantic annotation layer. From one 
hand, V4Ann acts as the semantic middleware, capturing, interlinking and serv- 
ing analysis results to multimedia analysis services. On the other hand, it pro- 
vides the semantic atom-based query infrastructure to retrieve generated assets. 
The conceptual architecture of V4Design, along with the position of V4Ann, is 
depicted in Fig. 1. All in all, V4Ann aims to address the following challenges: 
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— Annotation propagation and linking: In a multimodal content analysis setting, 
like in V4Design, a single media type can be analysed by multiple technolo- 
gies. For example, an image can be used for extracting building masks, as 
well as for aesthetics (style) extraction. Also, in many cases, there are inter- 
dependencies among the components, e.g. 3D model reconstruction needs as 
input video frame masks extracted by building localisation. It is important 
to have an efficient and interoperable way to represent, exchange and further 
link metadata, both structurally and semantically. 

— Context-aware retrieval: V4Design aims to create new multimedia content 
that can be integrated in existing architecture and video game design plat- 
forms, such as Unity® and Rhino’. Therefore, there is a need for practical 
and efficient retrieval mechanisms on top of the multimodal annotations. 
For example, to allow users to search for assets with certain styles or with 
advanced contextual filters, such as “castles near lakes”. 


In order to address the aforementioned challenges, V4Ann capitalises on and 
combines existing Semantic Web standards for resource annotation and inter- 
linking, inference and validation. More precisely, the WADM model is used as 
the core resource annotation pattern, combined with existing structured ontolo- 
gies and schemata (Sect.4). SPIN rules [7] and SHACL shapes [8] are used to 
derive additional relations among the annotated resources and for validating 
the generated knowledge graphs (Sect.5). Finally, keyword-based context-aware 
retrieval is facilitated to retrieve assets (Sect. 6). 


4 V4Ann Annotation Model 


Figure 2 illustrates the upper-level concepts of the V4Ann annotation model. 
The conceptual model revolves around the notions of annotations, media types, 
views and atoms. Annotations serve as resource containers, implementing the 
annotation pattern of WADM. Each annotation associates a media type (image, 
video, text, 3D model) with a view, which encapsulates a set of atoms. Each view 
defines one or more atoms, e.g. entities, tags, styles, etc. that are derived from 
multimedia content analysis. These atoms describe: (a) Aesthetics, i.e. architec- 
tural styles and creators that are extracted from images and videos; (b) Object 
and building types that are recognised in images and videos; (c) Named entities 
and concepts that are extracted from textual descriptions, e.g. image captions; 
(d) images and video frames used to reconstruct a 3D model. All atoms derived 
by aesthetics, localisation and text analysis are disambiguated, i.e. they are 
already mapped to WordNet, BabelNet or DBpedia resources by the content 
analysis services. Figure2 also presents SKOS mappings to the ORE specifica- 
tion, as well as subclass and subproperty relations to WADM and EDM. In the 
following we describe in details each key concept. 


8 https: //unity3d.com/. 
° https: //www.rhino3d.com/. 
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Fig. 2. The core concepts of the V4Ann annotation model defined as specialisation 
of WADM (oa namespace). Mappings to other models are also depicted, such as to 
Europeana Data Model (EDM) and Object Reuse and Exchange (ORE) initiative. 


4.1 Annotation Resources 


Four domain-specific annotation classes are defined for attaching atom views to 
media types!®: LocalisationAnnotation, TextualAnnotation, Aesthetics- 
Annotation and 3DModelAnnotation. According to the WADM specification, 
an annotation has 0 or more bodies (oa: hasBody), which encapsulate descriptive 
information, and a 1 or more targets (oa:hasTarget) that the bodies describe. 
V4Ann defines two subproperties to restrict the values of these properties, asso- 
ciating the targets (i.e. the media types) with view atoms. Intuitively, a V4Ann 
annotation has a context that describes a media type using views. In terms of 
OWL 2 semantics, the hasContext (EC oa:hasBody) property takes as values 
only instances of the View class and the describes (C oa:hasTarget) property 
at least one MediaType value. The Annotation class is defined as:!" 


Annotation L oa:Annotation M 


(1) 


describes.MediaType M VhasContext.View 


4.2 Media Types 


In order to define the targets of annotations (describes property assertions), 
V4Ann provides the MediaType upper-level class. There are four media types for 
annotations: Video, Text, Image, Mask E Image, Texture C Image and 3DModel. 
Each media type can be associated with additional descriptive information, such 


10 Tn the rest of the paper, we omit the v4d namespace. 
11 We use Description Logics [2] to represent the semantics. 
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as the source of the asset (e.g. the URL), license information, date of retrieval, 
etc. Intuitively, each media type resource represents a single multimedia asset 
for which a set of annotation atoms needs to be captured. 


4.3 Views and Atoms 


Views are container classes that encapsulate the annotation metadata (atoms) 
and they are used in hasContext property assertions. Each media type has a dif- 
ferent view. For example, the atoms of spatio-temporal building (BuildingView 
E View) and object localisation (ObjectView E View) in images and videos 
specify their type, i.e. whether the image or video contains a building, object 
or a painting. The semantics of OWL 2 allows us to define useful complex class 
descriptions to specify further dependencies, as described below. It should be 
noted that content analysis is not part of the V4Ann framework. As described 
in Sect.3, V4Ann aims to semantically capture the results of content analysis, 
which is part of the overall V4Design platform [1]. 


Aesthetics. Aesthetics extraction refers to the categorisation of the aesthetics 
of paintings and images that contain architecture objects and buildings based on 
their style (e.g. impressionism, cubism and expressionism), the creator (mainly 
for paintings) and emotion that they evoke to the viewer. Two properties are 
defined for creators (v4d: creator = schema:creator) and styles (v4d:style), 
whose domain is the v4d: AestheticsView class. 


AestheticsAnnotation L oa:Annotation [M 


ddescribes.{ Image U Video} M VhasContext.AestheticsView 


(2) 


AestheticsView C Vcreator.Creator [1 Vstyle.Style (3) 


The Creator and Style classes serve as container classes, allowing the cap- 
turing of data-specific properties, such as the classification confidence, as well as 
contain links to DBpedia and BabelNet. Figure3 presents an aesthetics annota- 
tion example (left part). The image depicts the Tholos of Delphi that has been 
given the atom (style) “Greek Architecture”. 


Object and Building Localisation. Building and interior objects localisation 
on art and architecture-related movies, documentaries and multiple art-images, 
aims to extract content that can be re-purposed and re-used in a meaningful and 
innovative way. Examples include buses, trains, as well as statues, buildings, etc. 

The extracted atoms (labels) are mapped to the V4Ann annotation model 
in terms of generated masks and tags. In videos, the results are also associated 
with frame(s) to capture the temporal aspects of localisation. 


LocalisationAnnotation L oa:Annotation [M 


(4) 


describes.{Image U Video} M VhasContext.LocalisationView 


LocalisationView C JhasTag.Tag M VhasFrame. integer (5) 
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Text Analysis. Text analysis provides the atoms that are derived from tex- 
tual content. For example, in addition to annotating images with building 
and objects, the assets are further enriched with named entities and concepts 
extracted from captions, titles and descriptions. V4Ann captures these atoms 
and associate them with the media type (video or image) that the textual content 
is relevant to through instantiations of the TextAnalysisView class. Example 
atoms include name, title, date, creator, designer, artist, location, etc., 
defined as subproperties of Tag. 


TextAnnotation L oa:Annotation I 


(6) 


ddescribes.{ Image Ll Video} M VhasContext.TextView 


TextView E JhasTag.Tag (7) 


3D Reconstruction. 3D reconstruction converts input video and images into 
3D point clouds and meshes. Apart from the actual 3D object, this process 
generates a variety of metadata, such as the number of point clouds, number of 
faces, textures, etc. The most important atom is the source of reconstruction, 
i.e. the video or the images the 3D model has been extracted from. 


3DModelAnnotation L oa:Annotation I 


describes.3DModel M VhasContext.3DModelView 


3DModelView E ShasSource. {Images U Video} (9) 


A 3D annotation example is depicted in Fig. 3 (right part). The annotation 
of the 3D model of Tholos is associated (image E hasSource) with the images 
that have been used for the reconstruction. It is assumed that the example image 
for aesthetics extraction is part of the set, demonstrating the way multimodal 
analysis results are interlinked. As we describe in the next section, these links 
are used to materialise additional relationships in the form of inference rules. 


5 Inference and Validation 


5.1 Implicit Relationships 


Additional inferences are derived by combining native OWL 2 RL reasoning and 
custom rules. The former is based on the OWL 2 RL profile semantics (OWL 
2 RL/RDF rules [12]), which is implemented by state-of-the-art triple stores, 
such as GraphDB. However, the semantics OWL 2 is limited. For example, only 
instances connected in a tree-like manner can be modelled [13]. V4Ann imple- 
ments domain rules on top of the graphs to express richer relations. SPARQL- 
based CONSTRUCT graph patterns are used that identify the valid inferences that 
can be made on the annotation graphs. It is beyond the scope of the paper to 
include an extensive coverage of relevant reasoning capabilities. In the following 
we present the concept of atom propagation that illustrates the principle idea. 
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Fig. 3. Example of atom propagation. The dashed arrow illustrates the enrichment of 
the 3D annotation resource with the aesthetics style derived from visual analysis. 


Since V4Ann follows a standard-based annotation pattern, additional rela- 
tions can be further derived. For example, the aesthetics atoms extracted from 
video frames can be used to annotate the 3D models that have been recon- 
structed using those frames. The principle idea is that atoms can be propagated 
among one or more views, provided that their annotations are associated. 

Figure 3 illustrates atom propagation between an aesthetics and 3D model 
annotations. The two annotations are connected at the view level, since the 
aesthetics annotation describes an image (img_1) that has been used to generate 
the 3D model of Tholos (id 3d_1). In this case, the view that describes the 
3D model inherits the atom (style) of the image (Greek Architecture). The 
corresponding SPARQL graph pattern is given bellow. 


CONSTRUCT { 
?view :style ?atom . 
} WHERE { 
fal a :AestheticsAnnotation; 
:describes ?img; :hasContext [:style ?atom] 
?a2 a :3DModelAnnotation; :hasContext ?view . 
?view :image ?img . 


5.2 Validation and Consistency Checking 


The validation process checks the consistency, structural and syntactic quality 
of the metadata. We use both native ontology consistency checking (e.g. OWL 
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2 DL reasoning) and custom SHACL validation rules, following the closed-world 
paradigm. The former handles validation taking into account the semantics at the 
terminological level (TBox), e.g. checking class disjointness. The latter detects 
constraint violations, e.g. missing values and cardinality violations. An example 
SHACL shape is given below that represents a constraint that all 3D model views 
should include references to the atoms (images) used to the 3D reconstruction. 


v4d:3DModelView 
rdf:type sh:NodeShape ; 
sh:property [ 
rdf:type sh:PropertyShape ; sh:path v4d:image ; 
sh:class v4d:MediaType ; sh:minCount 1 ; 
sh:name "one or more images" ; sh:nodeKind sh:IRI ; 


6 Context-Based Asset Retrieval 


In Sect.4 we described the process of creating the V4Ann annotation graphs, 
which involves the representation and further interlinking (e.g. through annota- 
tion propagation) of media type atoms. In this section we describe the approach 
of V4Ann towards enabling keyword-based context-aware retrieval of assets, cap- 
italising on the concept of local context. 


Definition 1. The local content lą of an atom t is defined as the tuple 
(t, r, her, hoz), where r, is the set of conceptually relevant terms, he, is the set 
of hypernyms and ho, is the set of hyponyms of t. 


Intuitively, a local context of an atom constitutes an enriched, pre- 
constructed semantic signature of this atom, taking into account conceptual and 
lexical relations from existing semantic networks and datasets, such as WordNet, 
BabelNet and ConceptNet (Fig. 4). In the case of hypernyms and hyponyms, we 
use the threshold h to specify the maximum level of relevant atoms. All in all, 
the retrieval mechanism of V4Ann aims to match incoming local contexts of 
query atoms (keywords) against local contexts of annotation atoms. 


6.1 The AH Metric 


The AH metric represents the similarity of two atoms taking into account their 
local context. It depends on a term similarity function S, and on a set F of local 
context filters. In the following, we assume that S(A, B) denotes the similarity 
of two atoms A and B, with respect to the function S, and that (A, B) € [0..1], 
with 1 denoting absolute match. We use the notation A Å B to denote that A 
matches to B, with respect to one of the following filters f: 


1. exact (e). The two atoms should have either the same URI, or they should 
be equivalent concepts, that is, AS BS A=BVA=B. 


134 G. Meditskos et al. 


square 
Germany 


Gendarmenmarkt 


Kulturdenkmal 


b) 


Fig. 4. (a) Generic local context of atom: relevant atoms are extracted from Con- 
ceptNet and BabelNet properties, hypernyms stem from WordNet and IS-A BebelNet 
relationships, hyponyms stem from WordNet; (b) example local context for “Gendar- 
menmarkt”. 


2. plugin (p). The atom B should belong to the set of hypernyms of A (he4) or 


to the set of relevant concepts of A (r4), that is, A A BS B € hea VB € r4. 
3. subsume (su). The atom B should belong to the set of the hyponyms of A, 
that is, AN B & B € hog. 


We generalize the A Z B relation to a set of filters F and we define that the 
atom A matches the atom B, with respect to a filter set F, if and only if there 


is at least one filter f in F, such that A FS B, that is: 


AM Beape ri ALB. 


Definition 2. The AH similarity of two atoms X and Y is the normalized value 
to [0..1] that is defined, with respect to a function S and a filter set F, as 


gine. a 
MERLOT Te ® (10) 
0 otherwise. 
We generalize (10) on two sets S4, Sp of atoms as 
max [AH(B, A, F)] 
VBESp eg 
AH set(Sa, Sp, F) = (11) 


|SB| 


Intuitively, for each atom B € Spg there should be at least one atom A € 
S4 relevant to B, with respect to the filter set F. Otherwise, AH set returns 
0 (absolute mismatch). The overall AH set similarity is computed as the mean 
value of the sum of the maximum AHs for each atom B, since each B may have 
more than one relevant atoms in S4. In V4Ann, S4 represents the atoms that 
are associated with an asset, while Spg is the set of keywords. 
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6.2 Atom Similarity S 


As a similarity function S(A, B), V4Ann uses a heuristic function that takes into 
account the information captured in local contexts of A and B, i.e. in the sets 
r, he and ho (see Definition 1). The implementation of S is summarised in the 
following priority rules r;, where rı > r2 > 73 > r4. 

rı: if A= BV A=B, then S(A, B) =1. 

r2: if B € hya V Bera, then S(A, B) =a. 

r3: if B € hoa, then S(A, B) = b. 

ra: S(A, B) =0. 


Currently, a and b (a > b) are defined manually based on domain knowledge 
regarding the quality of multimedia analysis that produces the atoms (e.g. aes- 
thetics extraction). The empirical definition of these values (currently a = 0.7 
and b = 0.3) aims to promote plugin matches (r2) over subsumed (r3). 


7 Evaluation and Discussion 


7.1 Digital Content 


Deutsche Welle (DW) and Europeana are two key content providers in V4Design. 
DW provides selected parts of their documentary and movie archives so as to 
localise building structures and objects. Europeana provides their large archive 
of paintings, pictures of contemporary artwork and related critics, for stylistic 
and aesthetics extraction and textual analysis. The generated V4Ann annota- 
tion graphs contain the atoms that have been extracted from the analysis com- 
ponents, along with interconnections among the annotation resources. Table 1 
provides some statistics for the annotation graphs. 


7.2 Evaluation 


User-Centred. A user-centred evaluation has been performed with a twofold 
purpose. First, to collect qualitative feedback on the results, as well as on non- 
functional aspects, such as query response time. Second, and most important, 
to generate an annotation dataset and assess the performance of V4Ann. 

Participants were invited to evaluate the current implementation by perform- 
ing keyword-based queries. A list of relevant resources has been provided, such as 
square names, monuments, building types, etc., in order to help them conduct 
relevant queries. Users filled in a five-point scale questionnaire (1-completely 
agree, 5-completely disagree). Sample questions are depicted in Table2. The 
feedback can be summarised as it follows: 


Table 1. The number of annotations and atoms in the V4Ann annotation graphs, 
along with the average size of local context for each atom (r + hy + ho). 


#annotations | ##atoms | Avg. local context size 
17245 154610 | 17 per atom 
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Table 2. Example questions answered by users. 


# | Question Mean (SD) 

Q1 | Atoms that are derived from visual analysis are most of the time | 1.7 + 0.83 
correct 

Q2 | Atoms that are derived from text analysis are most of the time | 2.34+0.79 
correct 

Q3 | Many times irrelevant results are top-ranked 3.97 + 1.29 

Q4 | There are many irrelevant results 2.43 + 1.41 

Q5 | It takes too long for the system to provide a response 4.04 + 0.99 

Q6 | There are too many “No results” responses 4.08 + 0.45 


— Quality of atoms: The quality and relevance of local contexts depends on 
the performance of content analysis, e.g. visual and textual analysis. Table 2 
shows that visual analysis provides, in principle, better results than text anal- 
ysis (Q1, Q2). 

— Retrieval results: According to Q3, the system achieves good top-ranked 
accuracy, however the complete set of the results contain quite a lot irrelevant 
entries (Q4). As we explain in the next section, this is mainly relevant to the 
context provided in the query (i.e. number of keywords). Due to the local 
context, the system was able to provide a response in most cases (Q6), even 
partially correct (Q4). 

— Response time: The response time of the system was positively assessed 
(Q5). The average response time was 4.1 seconds, which includes query anal- 
ysis, building of local context and searching algorithm execution. 


System Evaluation. We manually annotated the relevance sets of the per- 
formed queries, so as to quantitatively assess performance. Table 3 depicts the 
average precision and recall achieved for h = 1 and h = 3 and using different 
searching filters (Sect. 6.1). As expected, the stricter the filter is, the more accu- 
rate results we obtain (high precision) with low, however, recall. On the other 
hand, the more relax is the filter, the higher recall is achieved with a negative 
impact on the precision. This is due to the fact that with a strict filter (i.e. 
exact), the probability of finding the correct annotation is higher compared to a 
relaxed filter (i.e. subsume), since in the second case, impartial matches are also 
allowed. 

It should be noted that the overall performance of V4Ann strongly depends 
on the quality of the atoms, which in turn depends on the quality of the results 
provided to V4Ann. For example, if the wrong style for a painting is provided by 
aesthetics, this will affect precision, since V4Ann does not aim at improving the 
classification of incoming atoms. However, we plan to integrate multimodal data 
aggregation and fusion techniques to derive the most plausible classification of 
atoms and help improve the contextual information captured in local contexts. 
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Another interesting finding involves the threshold h. We observed that for 
h = 1 the framework provides better results than using h = 3, i.e. by enriching 
the local context with additional atoms, up to the third level. Intuitively, h 
allows to control the amount of contextual information taken into account during 
the definition of local contexts. A higher h value leads to more generic local 
contexts that affect precision. For example, the third-level WordNet hypernym 
of “tower” is “unit”, which is too generic, obfuscating the semantics of the atom. 
The optimal value of h depends on the concreteness of the atoms extracted 
from content analysis: the more specific is the label/atom, the more room for 
additional context exists. In our experiments, the labels we get tend to be generic, 
therefore the best performance is achieved with h = 1. 


Table 3. Average precision and recall (top-20 results). 


h=1 h=3 

Recall | Precision | Recall | Precision 
exact 0.59 |0.77 0.44 | 0.51 
plugin |0.67 | 0.69 0.52 | 0.48 
subsume | 0.73 | 0.61 0.59 | 0.42 


8 Conclusion 


In this paper we presented V4Ann, an ontology-based framework for represent- 
ing, linking and enriching results of multimedia analysis on digital content. 
V4Ann generates annotation graphs of image, video, textual analysis and 3D 
model reconstruction, so as to facilitate the systematic process, integration and 
organisation of information and establish practical repurposing mechanisms. 

The annotation model of V4Ann reuses existing standards and schemata, 
building the atom-based annotations graphs on top of standard ontologies, con- 
trolled vocabularies and patterns. The vocabularies are defined in OWL 2 and 
atoms are associated with assets using the WADM pattern. As such, it promotes 
interoperability, as well as fosters the use of declarative languages to identify fur- 
ther inferences and ensure the semantic consistency of the knowledge graphs. We 
also elaborated on the concept of local contexts, as well as on the AH metric 
for asset retrieval. We evaluated the framework using actual multimedia content 
and atoms provided by the V4Design modules and discussed the findings. 

V4Ann is accessible through Rhinoceros 3D (Rhino)!? and Unity plugins 
developed in the V4Design project through which users (architects and video 
games designers) can search for assets and import them in the scene. For future 
work we plan to implement context-aware algorithms to improve the classifica- 
tion accuracy of incoming atoms, as well as to extend the context-aware retrieval 
algorithm with more sophisticated similarity metrics and functions. 


12 https: //gitlab.com/v4designEU /v4d4rhino. 
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Abstract. RSP-QL was developed by the W3C RDF Stream Process- 
ing (RSP) community group as a common way to express and query 
RDF streams. However, RSP-QL does not provide any way of annotat- 
ing data on the statement level, for example, to express the uncertainty 
that is often associated with streaming information. Instead, the only 
way to provide such information has been to use RDF reification, which 
adds additional complexity to query processing, and is syntactically ver- 
bose. In this paper, we define an extension of RSP-QL, called RSP-QL*, 
that provides an intuitive way for supporting statement-level annota- 
tions in RSP. The approach leverages the concepts previously described 
for RDF* and SPARQL*. We illustrate the proposed approach based on 
a scenario from a research project in e-health. An open-source implemen- 
tation of the proposal is provided and compared to the baseline approach 
of using RDF reification. The results show that this way of dealing with 
statement-level annotations offers advantages with respect to both data 
transfer bandwidth and query execution performance. 


Keywords: RSP-QL* - RDF* - RDF Stream Processing - e-health 


1 Introduction 


Recent years have seen an increasing interest in processing and analyzing stream- 
ing information as it is generated by applications, services, sensors, and smart 
devices. RDF Stream Processing (RSP) leverages the principles of Linked Data 
and the Semantic Web to cope with heterogeneity in data, but employs strate- 
gies inspired from stream processing to cope with high velocity data streams. 
During the last decade, several RSP systems and models have been proposed, 
which have all provided their own syntax, semantics, and underlying assump- 
tions about the nature of RDF streams [6,7]. The RSP community group! was 
formed to define a common model for producing, transmitting and continuously 
querying RDF streams. The first version of this common query model (RSP-QL) 


1 https: //www.w3.org/community/rsp/. 
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was proposed by Dell’Aglio et al. in 2014 [7], and the draft of the abstract syntax 
and semantics was published by the RSP community group in 2016 [2]. 

Data generated by sensors is almost always coupled with provenance infor- 
mation, or a level of uncertainty representing, for instance, lack of precision 
or a knowledge gap. For example, all values reported by a temperature sensor 
may be associated with some error describing a probability distribution. The 
RDF specification provides a vocabulary that allows metadata to be represented 
about RDF triples using RDF reification [11]. In practice, however, this is not 
widely adopted as a standard for representing and managing such metadata on 
the Semantic Web [8]. RDF* was recently proposed as a way to support a concise 
representation of statement-level metadata, while remaining backwards compat- 
ible with standard RDF [9,10]. By enclosing a triple using the strings ‘<<’ and 
‘>>’, the extension allows it to be used in the subject or object position of other 
triples. This allows statement-level metadata to be provided directly. For exam- 
ple, the triple :bob :knows :alice could be annotated with the source wikipedia as 
follows: <<:bob :knows :alice>> :source :wikipedia. Similarly, the authors’ propose 
SPARQI‘ as an extension of SPARQL for querying RDF* data, where SPARQI* 
supports similar nesting of triple patterns. 

We propose an extension to RSP-QL that leverages RDF*/SPARQI* for anno- 
tating and querying streaming data. We show that the proposed approach has 
several benefits over RDF reification when it comes to statement-level annota- 
tions. The approach is motivated based on a use case from a current research 
project, where we attempt to detect abnormal situations in an e-health scenario. 

The rest of the paper is organized as follows. Section 2 briefly discusses the 
relevant related work, while Sect. 3 describes a use-case scenario that both moti- 
vates the proposed approach and exemplifies the requirements addressed by the 
proposal. Section 4 describes the proposed approach informally, and Sects. 5 and 
6 provide the necessary formal definitions, where Sect. 5 defines the data model 
and Sect. 6 defines the syntax and semantics of the proposed RSP-QL extension. 
Section 7 provides an application-based evaluation of the approach. Section 8 
describes a prototype implementation and a performance evaluation of the imple- 
mented system. Section 9 discusses the impact of the presented work and Sect. 10 
summarizes the main conclusions of the paper. 


2 Related Work 


Over the past decade, there has been a growing interest in providing models 
and languages for combining the principles of the Semantic Web with streaming 
information. RDF Stream Processing (RSP) systems aim to provide extensions 
to RDF and SPARQL for representing and querying streaming data. However, 
though several RSP systems have emerged that provide extensions and oper- 
ators for this purpose [1,3,4,13,18], they typically provide different languages, 
constructs, operators, and evaluation semantics |7]. The W3C RSP community 
group was formed to define a common model for representing and querying 
streaming RDF data. The proposed model and language, RSP-QL [7], can be 
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used to model the behavior of most of the current RSP systems, and provides 
well-defined semantics for explaining query execution. However, none of the exist- 
ing RSP approaches have given much attention to aspects related to representing 
metadata in streams, such as uncertainty or provenance. The RSP-QL stream 
model allows such annotations to be provided on the graph level, but annotations 
on the triple level are not supported. 

The term statement-level metadata refers to data that captures information 
about a single statement or fact. The RDF specification includes the notion 
of RDF reification that lets a set of RDF triples describe some other RDF 
triple [11]. The approach requires the inclusion of four additional RDF triples 
for every statement where metadata is to be provided. Another approach is to 
leverage named graphs, where the identifier of the graphs can be used to attach 
metadata to statements [12]. However, this approach has the disadvantage of 
inhibiting the application of named graphs for other uses. Finally, singleton prop- 
erties have been proposed as an alternative approach, where a distinct property 
is provided for each triple to be annotated [15]. The singleton properties pro- 
posal introduces a large number of unique predicates, which is atypical for RDF 
data, and disadvantageous for common SPARQL optimization techniques [19]. 
Additionally, these approaches result in verbose queries [9]. For standard RDF, 
there therefore exists no convenient way of annotating data with metadata on 
the statement level [10]. The RDF*/SPARQI* approach was proposed as a way 
of supporting a more intuitive representation, by allowing triples in the subject 
and object positions of RDF statements [9,10]. In this paper, we propose to 
extend RSP-QL based on this approach. 


3 Use-Case Scenario 


In this section, we describe a use-case scenario to exemplify the kinds of require- 
ments that may be addressed by combining RSP-QL with RDF*/SPARQI* The 
scenario originates from an ongoing research project, E-care@home?, in which 
the aim is to develop privacy-preserving Al-solutions for home care of elderly 
patients. In addition to developing technical solutions, the project has put great 
emphasis on studying the requirements of stakeholders. These requirements have 
been documented in a project deliverable [14]. As part of this deliverable, a 
number of personas and use-case scenarios were also developed, including the 
following description of a scenario involving the patient Rut who has advanced 
chronic obstructive pulmonary disease (COPD) and is multimorbid. 

“The system can automatically sense abnormal situations, e.g. when certain 
health parameters deviate from the normal values, or when the overall situation 
as assessed by a multitude of sensors appears abnormal. When the system detects 
such situations, it sends out an alarm to a suitable recipient based on the severity 
of the deviation (e.g., emergency dispatch for a life-threatening deviation, the 
patient’s physician if no immediate action is required, or next-of-kin if suitable). 
[...] Today the system has detected an abnormal state. Rut appears to have been 


? http: //ecareathome.se/. 
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sitting in the same position in a chair in the living room for an unusually long 
time given that there are no entertainment devices turned on at the moment. 
Her heart rate is above normal, but her breathing is slower than normal. Small 
motions indicate that she is not asleep, yet she is not moving much. Her oxygen 
levels are about normal. The system decides to classify this as a low-emergency 
abnormal state. The system also knows that Rut’s partner has left the house a 
few hours ago. It therefore sends an alert to him [...] the alert reaches Rut’s 
partner as he is already on his way home. He hurries home and opens the door 
only to find out that Rut is in good health and has been enjoying a paperback copy 
of the latest crime novel by a famous Swedish author for the past few hours” [14]. 

Like any health-care system, the one envisioned by E-care@home sets high 
requirements in terms of patient safety, system reliability, and transparency. To 
this end, all the data that the system uses to draw conclusions and to generate 
suggestions, or even to take action, must be accompanied by some assessed 
confidence. For instance, in the scenario above, to put patient safety first the 
system cannot afford to miss an abnormal and highly dangerous situation, but on 
the other hand it needs to be able to disregard observations that are not reliable. 
As an example, whenever a pulse oxymeter reports the oxygen saturation of a 
patient, the system also needs to know the confidence that the system can put 
in this value. The sensor may have a fixed confidence value, but the system may 
also derive an adjusted value that takes into account contextual factors of the 
measurement, such as the position of the sensor and the activity of the patient 
at measurement time. Regardless of how the confidence value is derived, it needs 
to be reported as part of the reported observation. 


4 Overview of RSP-QL* 


The main difference between RSP and traditional RDF/SPARQL processing is 
that the former introduces a time dimension to processing [6]. The time dimen- 
sion in RSP-QL is managed by allowing windows to define discrete subsets over 
RDF streams, and at any point in time, a window can be queried as a regu- 
lar RDF dataset. The approach proposed in this paper extends RSP-QL in two 
fundamental ways: RDF streams are extended to support RDF*, and the sup- 
ported graph patterns in RSP-QL are extended to support those in SPARQI* 
The example in Listing 1.1 shows an RSP-QL* query that illustrates the main 
features and language constructs. 

The registered query is evaluated every 10 seconds. It defines a time-based 
window with a width of 1 min that slides every 10s over the heart-rate stream. 
The query then matches the heart-rate value and confidence of each observation 
in the window using an RDF* pattern [9]. This is the only difference between 
RSP-QL and RSP-QL* in this query. The results are then filtered based on a 
threshold, and the heart-rate value and timestamp of the matched observations 
are reported. There are conceptually no limitations on the complexity of the 
provided annotations, and they can, e.g., instead be represented as confidence 
intervals or distributions rather than single values. 
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PREFIX ex: <http://www.example.org/ontology#> 
PREFIX sosa: <http://www.w3.org/ns/sosa/> 
REGISTER STREAM <heart-rate/alert> COMPUTED EVERY PT10S AS 
SELECT ?hr ?time 
FROM NAMED WINDOW <window/1> ON <http://stream/heart-rate> [RANGE PT1M STEP PT10S] 
WHERE { 
WINDOW <window/1> { 
GRAPH ?g { 
<<?o0bs sosa:hasSimpleResult ?hr>> ex:Confidence ?confidence . 
FILTER(?confidence > 0.9 && ?hr > 120) 
} 


?g <generatedAt> ?time . 


Listing 1.1. Example of an RSP-QL* query. 


5 Data Model 


This section defines the concepts that capture the notion of streams considered 
by our approach. We begin with the basic notions of RDF and RDF*. 

As usual [5,16], we assume three pairwise disjoint, countably infinite sets 
T (IRIs), B (blank nodes), and £ (literals). Then, an RDF triple is a tuple 
(s,p,0) € (TUB) xT x (TUBU £), and an RDF graph is a set of RDF triples. 
For such a triple (s,p,0), s is called the subject, p the predicate, and o the object. 

RDF* extends this notion of triples by allowing the subject or the object to be 
another triple [9]. This form of nesting of triples, which may be arbitrarily deep, 
allows for statements to capture metadata about other statements. Formally, an 
RDF* triple is defined recursively as follows [9]: (i) any RDF triple is an RDF* 
triple, and (ii) given two RDF* triples t and ¢’, and the RDF terms s € (TU B), 
p € T, and o E€ (TU BU £), the tuples (t,p, 0), (s,p,t), and (t,p,t’) are RDF* 
triples. Furthermore, a set of RDF* triples is called an RDF* graph. 

The concept of an RDF dataset has been introduced to represent collections 
of RDF graphs [5]. We extend this concept to cover RDF* graphs. 


Definition 1. A named RDF* graph is a pair (n,G*) where n € (TU B), 
which is called the graph name, and G* is an RDF* graph. An RDF* dataset 
is a set D = {G%, (nı, GT), (n2, G3), ..., (ni, G*)}, where G8 is an RDF* graph, 
called the default graph of D, and (ng, G4) is a named RDF* graph for all 
k € {1,2,..., c}. 


While the RDF model is atemporal, the notion of an RDF stream has been 
introduced to capture the dynamic nature of streaming RDF data [7]. Along the 
same lines, we define an RDF* stream as a time-ordered sequence of elements 
that are captured by a specific form of RDF* datasets. 


Definition 2. Let p be an IRI that denotes a predicate to capture timestamps for 
named RDF* graphs. Then, an RDF* stream element E is an RDF* dataset 
that consists of a default graph G* and exactly one named RDF* graph (n, G*) 
such that the default graph G% contains one RDF triple of the form (n,p,T), 
where T is a timestamp. To denote this timestamp T in E we write T(E). 
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Definition 3. An RDF* stream S is a potentially unbounded sequence of 
RDF* stream elements such that for every pair of such elements E; and Ej, 
where F; comes before E; (i.e., S = (..., Ej,..., E;,...)), the following properties 
hold: 


1. T(E;) < 7(E;), and 
2. the names of the single named RDF* graph (n;,G¥) in E; and of the single 
named RDF* graph (nj, G4) in Ej are different (i.e., ni # nj). 


A named RDF* stream is a pair (n, S) where n € T and S is an RDF* stream. 


We also need to define a notion of windows over such streams as a way of 
referencing discrete portions of potentially infinite data streams [7]. 


Definition 4. A window W over an RDF* stream S is a finite set of RDF* 


stream elements from S. 


In this paper, we focus explicitly on temporal window operators (other win- 
dow operators, such as count-based windows, can be defined in a similar manner). 
To this end, we define a time-based window of an RDF* stream as a contiguous 
set of elements from the stream whose timestamp is in a given interval. 


Definition 5. Given a time interval [l,u), the time-based window over an 
RDF* stream S for {l,u), denoted by W(S,1,u), is a window over S that is defined 
as follows: W(S,l,u) = {E | E is in S andl < T(E) < u}. 


Finally, we shall need a function that represents any window as an RDF* 
dataset. Informally, this dataset consists of all the named RDF* graphs of all 
RDF* stream elements within the window, and the default graph of this dataset 
is constructed from the default graphs in all these RDF* stream elements. 


Definition 6. Let W = {F,, E2, ..., En} be a window over some RDF* stream. 
The dataset representation of W, denoted by DS(W), is the RDF* dataset 
that is constructed as follows: 


- the default graph of DS(W) is Gi = Urea (n.G*)}ew Gia and 
~ the set of named RDF* graphs in DS(W) is {(n, G*) | {Gaps (n, G*)} € W}. 


6 Syntax and Semantics of RSP-QL* 


This section defines RSP-QL*, which is an RDF*aware extension of RSP-QL. 
RSP-QL, in turn, is an extension of SPARQL. Hence, our definitions in this 
section extend RSP-QL [7] along the lines of how SPARQL* extends SPARQL [9, 
10], and by also taking into account the abstract syntax and semantics draft of 
the W3C RSP community group [2]. For the SPARQL-specific constructs we 
adopt the algebraic SPARQL syntax introduced by Pérez et al. [16]. Due to 
space constraints, we limit ourselves to presenting only the core concepts of the 
language. 
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6.1 Syntax of RSP-QL* Queries 


RSP-QL is an extension of SPARQL [17], and the basic building block is a basic 
graph pattern (BGP), that is, a finite set of triple patterns. A triple pattern is a 
tuple (s,p,0) E€ (VU BUT) x (VUT) x VUBUTUL), where Y is a countably 
infinite set of query variables that is disjoint from $, Z, and £, respectively. 
Like SPARQI* [9,10], RSP-QL* extends these notions further by supporting 
the concept of triple* patterns, which add the possibility to nest triple pat- 
terns (arbitrarily deep), and which are defined recursively as follows [9, 10]: 


— any triple pattern is a triple* pattern, and 
— given two triple” patterns tp and tp’, and s € (TUBU V), p € (TUV), and 
o € (LUBULUY), then (tp, p, 0), (s, p, tp), and (tp, p, tp’) are triple* patterns. 


A finite set of triple* patterns is referred to as a BGP*. 

On top of BGPs, RSP-QL supports all the other forms of graph patterns that 
have been introduced for SPARQL, and RSP-QL adds a new form to match data 
within windows of streaming data. We define a corresponding notion of patterns 
for RSP-QL*, but for brevity we here focus only on the core constructs. 


Definition 7. An RSP-QL* pattern is defined recursively as follows: 


1. Any BGP* is an RSP-QL* pattern. 

2. Ifn € (VUT) and P is a RSP-QL* pattern, then (WINDOW n P) and 
(GRAPH n P) are RSP-QL* patterns. 

3. If P, and Pz are RSP-QL* patterns, then (P, AND Pz), (Pı OPT Pz), and 
(Pi UNION Pz) are RSP-QL* patterns. 


In addition to such patterns, every RSP-QL* query may declare windows over 
named RDF* streams, which we capture by the concept of window declarations. 


Definition 8. A window declaration is a tuple (us,a, 3,70) where us € T 
is an IRI (representing the name of a named RDF* stream), a is a time dura- 
tion (representing a window width), B is a time duration (representing a slide 
parameter), and To is a timestamp (representing a start time). 


We now have everything required to define RSP-QL* queries, which consist 
of an RSP-QL’ pattern and window declarations that are associated with IRIs 
to serve as names for the corresponding windows in the query. 


Definition 9. An RSP-QL* query is a pair (w, P) where w is a partial func- 
tion that maps some IRIs in T to a window declaration, respectively, and P is 
an RSP-QL* pattern such that for every sub-pattern (WINDOW n P’) in P it 
holds that ifn € T, then w is defined for n, i.e., n E€ dom(w). 
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6.2 Semantics of RSP-QL* Queries 


We now define the semantics of RSP-QL* queries, for which we have to introduce 
some concepts used to define the query semantics of SPARQL and of SPARQI* 
The query semantics of SPARQL is based on the notion of solution map- 
pings [16] that map query variables to blank nodes, IRIs, or literals. For 
SPARQIX this notion has been extended to also be able to map to RDF* triples. 
That is, a RSP-QL* is a partial function 7 : V — (T UTU BU £) where T 
denotes the set of all RDF* triples [9,10]. The standard notions of compatibility, 
merging and application of solution mappings can then be adapted as follows. 


Definition 10. Two solution mappings n, n are compatible if n(v) = n (v) 
for every variable v € dom(n)N dom(n’). 


Definition 11. The merge of two compatible solution* mappings n and 1, 
denoted by n Un, is a solution* mapping n” with the following three properties: 


- dom(nf") = dom(n) U dom(n’), 
— n” (v) = n(v) for all v € dom(n), and 
-= n” (v) = n (v) for all v € dom(n)’ \ dom(n). 


Definition 12. The application of a solution* mapping ņ to an RSP-QL* pat- 
tern P, denoted by n|P], is the RSP-QL* pattern obtained by replacing all vari- 
ables in P according to n. 


We now define the corresponding algebra operators join, union, and left join. 
Definition 13. Let Nı and (22 be sets of solution* mappings. 


21 ™ Ro = {m Un | m E 21,72 E 22, N and n are compatible} 
2,U Ra = {N| N E Qı orn € Rə} 
Ri X Nz = (R1 M R2) U {N E Ri | Vn! € R2 : N and 7! are not compatible} 


Based on these algebra operators, RSP-QL* patterns are evaluated over a 
background dataset and a set of named windows at a given timestamp. 


Definition 14. Let W be a partial function that maps some IRIs in T to a 
window over some RDF* stream, respectively, and P be an RSP-QL* pattern 
such that for every sub-pattern (WINDOW n P') in P with n € T, it holds that 
W is defined for n, i.e., n € dom(W). Furthermore, let D be an RDF* dataset, 
G be an RDF* graph, and T be a timestamp. Then, the evaluation of P over 
D and W att with G, denoted by ele is defined recursively as follows: 


1. If P is a triple* pattern tp, then [ple = {n | dom(n) = var(tp) and n(tp) € 
G} where var(tp) denotes the set of variables occurring in tp. 

2. If P is (GRAPH u P'), then [Pie * = [ple where (u,G’) E€ D 

3. If P is (GRAPH ta P'), then [Pe "" = UacnepIGRAPH u P'le” 

4. If P is (WINDOW u P’), then [PJW = [P1220 where W = W(u) 
and G" is the default graph of the RDF* dataset DS(W) 
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If P is (WINDOW? P'), then [P]Ẹ "7 = Uucaomwy [WINDOW u Pg" 
If P is (P1 AND P2), then [P]} 7 = [PET » [P2807 
If P is (P1 UNION P2), then [P]E Y" = [PRW u [P2] 7 
D,W,r D,W,r D,W,r 
If P is (P1 OPT P2), then Ple ” = Pie & [P2]¢ 


SRO 


It remains to define the semantics of RSP-QL* queries, which contain window 
declarations in addition to an RSP-QL* pattern (cf. Definition 9). 


Definition 15. Let S be a finite set of named RDF® streams and q = (w, P) be 
an RSP-QL* query such that for every IRI us € dom(w) there exists a named 
RDF* stream (ug,S) € S. Furthermore, let D be an RDF* dataset and T be a 
timestamp. The evaluation of q over D and S at T, denoted by [q]?°5”, is 
defined as [qj?*7 = ei where G is the default graph of D and W is a 
partial function such that dom(W) = dom(w) and for every IRI u € dom(W), 
it holds that W (u) is the time-based window W(S,x — a,x) with (us,S) € S, 
(us, @, B, To) = w(u) and x = To +a + 8 x i for the greatest possible i € N for 
which £ < T. 


7 Application-Based Evaluation 


In this section, we evaluate RSP-QL* based on the application use-case scenario 
introduced in Sect. 3. To this end, we make three assumptions: First, we assume 
that all parameters about the patient are provided in separate streams. 

Second, the thresholds for the physiological parameters are context depen- 
dent, and we assume that the background data contains information about Rut’s 
expected values with respect to some activity. Third, we assume that all physio- 
logical parameters are reported with a confidence value representing some inher- 
ent uncertainty of the sample. 

Listing 1.2 illustrates a typical query for the application scenario. For the 
sake of readability, we have simplified the query slightly compared to the actual 
project application. Additional optimization strategies would also be employed 
in practice to provide improved scalability. 

The inputs to the query are 5 different streams that report data about the 
patient’s current heart rate, breathing rate, oxygen saturation, location (of both 
Rut and Rut’s partner), and current activity, respectively. The activity stream 
might have been created by another reasoning mechanism in the system, which 
infers activities of daily life based on sensor inputs and the context. For each 
window, the values are filtered for specific values or a confidence threshold, 
and then the aggregated data is checked against the threshold values specific 
to the current context of the patient (e.g., including the type of activity). If 
these conditions are met, we consider it a low-emergency situation, as described 
in the scenario outlined in Sect.3. The resulting event is pushed to another 
stream upon which the system can act appropriately. In our use-case scenario, 
the system would first contact Rut’s partner. Similar queries could be set up to 
deal with other situations that the system should be able to detect. 


RSP-QL*: Enabling Statement-Level Annotations in RDF Streams 149 


BASE <http://base/> 

PREFIX ex: <http://www.example.org/ontology#> 
PREFIX foaf: <http://xmlns.com/foaf/0.1/> 
PREFIX sosa: <http://www.w3.org/ns/sosa/> 


REGISTER STREAM <alert/lowEmergencyAbnormalState> COMPUTED EVERY PT10S AS 


SELECT ?activity (AVG(?hr) AS ?avgHr) (AVG(?br) AS ?avgBr) (AVG(?0x) AS ?avg0x) 
FROM NAMED WINDOW <w/1> ON <s/activity> [RANGE PT10M STEP PT10S] 
FROM NAMED WINDOW <w/2> ON <s/location> [RANGE PT10M STEP PT10S] 
FROM NAMED WINDOW <w/3> ON <s/heart> [RANGE PT1M STEP PT10S] 
FROM NAMED WINDOW <w/4> ON <s/breathing> [RANGE PT1M STEP PT10S] 
FROM NAMED WINDOW <w/5> ON <s/oxygen> [RANGE PT1M STEP PT10S] 
WHERE { 
?person a foaf:Person ; 

foaf:name "Rut" ; 

ex:home ?home ; 

ex:partner ?partner 


[] a ex:NormalSituation ; 
ex:forPerson ?person ; 
ex:forActivity ?activity ; 
ex:expectedHeartRate [ ex:upperBound ?hrMax ] ; 
ex:expectedBreathingRate [ ex:lowerBound ?brMin ] ; 
ex:expectedOxygenSaturation [ ex:lowerBound ?oxMin ; ex:upperBound ?oxMax ] 


WINDOW <w/1> { # Current activity, reported by the system 
GRAPH ?g1 { 
[ a sosa:Observation ; 
sosa:featureOfInterest ?person ; 
sosa:hasSimpleResult ?activity ] 


} 

} 

WINDOW <w/2> { # Location of Rut’s partner 
GRAPH ?g2 { 

[ a sosa:0bservation ; 
sosa:feature0fInterest ?partner 
sosa:hasSimpleResult ?loc ] 

FILTER(?loc != ?home) 

} 
} 
WINDOW <w/3> { # Heart rate 
GRAPH ?g3 { 
?03 a sosa:Observation ; 
sosa:featureOfInterest ?person . 
<<?03 sosa:hasSimpleResult ?hr>> ex:confidence ?c3 . 
FILTER(?c3 > 0.95) 
} 
} 
WINDOW <w/4> { # Breathing rate 
GRAPH ?g4 { 
?04 a sosa:0bservation ; 
sosa:feature0fInterest ?person . 
<<?04 sosa:hasSimpleResult ?br>> ex:confidence ?c4 . 
FILTER(?c4 > 0.95) 
} 
} 
WINDOW <w/5> { # Oxygen saturation 
GRAPH ?g5 { 
?05 a sosa:0bservation ; 
sosa:featureOfInterest ?person . 
<<?05 sosa:hasSimpleResult ?0x>> ex:confidence ?c5 . 
FILTER(?c5 > 0.95) 
} 


} 


GROUP BY ?activity ?hrMax ?brMin ?oxMin ?oxMax 
HAVING(?avgHr > ?hrMax && ?avgBr < ?brMin && ?0xMin <= ?avgOx && ?avgOx <= ?0xMax) 


Listing 1.2. The RSP-QL”* query used in the use-case evaluation. 
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The application of RSP-QL* to this project use case shows that it is possible 
to express the queries needed, and that the proposed language thereby fulfills 
our use-case based requirements. In particular, it is worth noting the compact- 
ness and relative readability of the query in Listing 1.2, as compared to the 
corresponding RDF reification query? (excluded to space constraints). 


8 Performance Evaluation 


In this section, we begin by briefly describing a prototype implementation of the 
proposed approach. We then report on the effects of the proposed RDF stream 
model with respect to data bandwidth, and compare it with a baseline approach 
of using RDF reification. Finally, we compare the query execution performance 
of the prototype when using RDF* as opposed to RDF reification, while varying 
the number of annotated triples per streamed element. 

All experiments were run on a MacBook Pro with 16 GB 1600 MHz DDR3 
memory, and a 2.8 GHz Intel Core i7. The experiments were run using Java 
1.8.0 with 2048 MB allocated for the JVM. All experiments were preceded by 
warm-up runs and averages for execution times were collected only after memory 
usage had stabilized. 


8.1 Prototype Implementation 


We implemented the prototype using Apache Jena* and RDFstarTools’, where 
the latter provides a collection of Java libraries for processing RDF* data and 
SPARQI* queries. Additionally, we implemented a separate RSP-QL* query 
parser and integrated it with the standard Jena architecture, along with an 
extension of Jena’s query class to support the additional syntax elements defined 
in RSP-QL* 

For the query execution, the implementation provides an extension of Jena’s 
query engine and query execution, supporting the new query operators. During 
query execution, all windows over streams are materialized as individual RDF* 
datasets. The execution’s active dataset then changes as needed when a window 
operation is evaluated. To improve evaluation efficiency, all parsed nodes are 
encoded as integers in one of two dictionaries: the node dictionary or the ref- 
erence dictionary. Regular RDF nodes are added to the node dictionary, while 
triple nodes are added to the reference dictionary, which (recursively) encodes 
each separate node of the triple. All nodes, regardless of type, are internally 
represented as an integer, where the most significant bit signals whether the 
ID represents a regular node or a reference triple. This allows the system to 
quickly check how a node should be decoded. Encoding and decoding iterators 


3 https: //github.com/keski/RSPQLStarEngine/tree/master/publications/ 
semantics2019. 

t https: //jena.apache.org/ (version 3.8.0). 

5 https: //github.com/RDFstar/RDFstarTools (version from 2019-02-28). 
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are provided to support moving between I[D-based iterators, and Jena’s standard 
iterator implementations. 

The prototype is provided as open-source® under the MIT License. The 
underlying data structures can easily be changed by providing alternative imple- 
mentations for the corresponding interfaces. 


8.2 Serialization Overhead 


One of the side-effects of using RDF reification to annotate triples is that it 
increases the size of the dataset, since for each reification triple four additional 
triples have to be added. Thus, one of the benefits of the proposed extension 
is the reduced overhead involved in transferring statement-level annotations in 
data streams. To compare the impact on bandwidth requirements, we compared 
the overhead in terms of bytes for each of the two approaches. The data was 
serialized using TriG*, which is an extension of Turtle* [9] for supporting named 
graphs, and compressed’. 

The amount of metadata per annotated triple impacts the relative overhead 
of the two approaches. For this evaluation, the TriG* serialization of each RDF* 
stream element contains declarations of one prefix, a base IRI, and a single 
metadata statement per annotated triple. Figure 1 shows the bandwidth required 
by the approaches as a function of the number of annotated triples per streamed 
element. The results show that the amount of bytes required when using RDF* 
is around half of what is required when using RDF reification. 


8.3 Query Execution Performance 


The performance of the approach was evaluated on the prototype implemen- 
tation. The streamed elements contained a single confidence annotated triple, 
where the number of additional triples annotated with some other metadata 
predicate varied between experiments runs. A single evaluation query was used 
to match and filter all triples annotated with the confidence value. We compared 
query execution times when representing the metadata using RDF* and querying 
it using RSP-QL* versus representing the metadata using RDF reification and 
querying it using pure reification-based RSP-QL queries. The prototype applies 
no specific optimization techniques for the queries; thus, the two approaches 
differ only with respect to how statement-level metadata is represented inter- 
nally. The RDF reification approach simply uses regular triple-pattern match- 
ing, whereas the RDF* approach represents the annotated triples as resources 
on the physical level. For the RDF reification query, we provided an additional 
version of the query optimized based on the heuristics described by Tsialiamanis 
et al. [19], where the order of the matched triple patterns was determined based 
on selectivity. Figure 2 presents the average query execution times. The results 


6 https://github.com/keski/RSPQLStarEngine. 
T Compression here included the removal of excessive whitespace characters, the use 
of prefixes, and the use of predicate lists where appropriate. 
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Fig. 1. Byte size of a serialized RDF* 
stream element as a function of the 
number of triples in it, where each 
triple is annotated with exactly one 
metadata triple. 
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Fig. 2. Average query execution times 
of the prototype implementation when 
using either RDF’, reification, or reifi- 
cation with query optimization based 
on selectivity. 


show that the advantage of the proposed approach grows with the number of 
distinct triples annotated in each streamed element, but that this difference can 
potentially be reduced by applying established optimization heuristics. 


9 Discussion 


The proposed approach provides a compact and intuitive way for both represent- 
ing and querying annotated triples. Other approaches that could be considered 
for this purpose include single-triple named graphs [12], singleton properties [15], 
and RDF reification [11], but these approaches come with various drawbacks. 

The application of named graphs inhibits the use of the graph name for other 
purposes, which means it is not compatible with the structure of RDF stream 
elements. Singleton properties introduce large numbers of unique predicates, 
which can adversely affect query execution performance. RDF reification, on 
the other hand, is both part of the RDF standard and can be supported in 
RSP-QL. However, RDF reification is verbose, both with respect to representing 
and querying data. 

We note that RDF* and SPARQI* may be understood simply as syntac- 
tic sugar on top of RDF and SPARQL [9], and by extension this applies to 
the approach presented in this paper. However, the evaluation of the prototype 
implementation illustrates that representing annotated triples as resources on 
the physical level can have positive effects on the query execution level. When 
matching a single RDF reification triple, a total of four additional triple pat- 
terns have to be evaluated. In fact, due to this inefficiency, many RDF stores 
implement specific strategies for representing annotated triples [8]. For example, 
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Virtuoso® encodes RDF reification statements as quads, Apache Jena? provides 


an implementation of a node type with direct access to the statement it reifies, 
and Blazegraph!® uses an approach similar to the one implemented in our pro- 
totype. 

RDF* and SPARQIX, and thus RSP-QL*, simplifies the representation of com- 
plex scenarios, both from the perspective of modeling and of querying annotated 
metadata. For example, we may want to treat an RDF statement differently 
depending on whether the uncertainty associated with it has been automatically 
generated by a sensor, or if it originates from a physician. Querying this using 
RSP-QL* simply involves having a triple* pattern with two layers of nesting. 

As part of future work, we plan on relaxing some of the assumptions made in 
the semantics, and add support for additional features defined in RSP-QL, such 
as count-based window operators and output stream operators. 


10 Conclusion 


In this paper, we have presented a novel way of annotating and querying 
statement-level metadata in RDF Stream Processing (RSP), and formally 
defined the new continuous query language RSP-QL* The approach extends 
RDF streams to allow triples to directly use other triples in the subject and 
object positions, and similarly extends the current version of RSP-QL to query 
these, by leveraging and building on the concepts previously proposed for RDF* 
and SPARQL* [9, 10]. 

The proposed approach was applied in a use case from an e-health research 
project, where multiple data streams have to be queried in parallel, and over 
extended periods of time, to detect possibly abnormal situations. The results 
show that RSP-QL* meets all our use-case requirements, and provides a com- 
pact and intuitive way of expressing and querying statement-level metadata, 
compared with the baseline approach of using RDF reification. Furthermore, 
the prototype implementation presented in the paper, which is provided as open- 
source, demonstrates benefits over the baseline approach, both with respect to 
the bandwidth required for data transfer and with respect to query execution 
performance over statement-level annotations. RDF*is a syntactically more com- 
pact way to express metadata annotations, and our experiments show that this 
difference is large enough to have an impact in deployed real-world systems and 
applications, where bandwidth may be limited. Although our prototype imple- 
mentation is not optimized for query performance, we were able to demonstrate 
that the approach was faster with respect to query execution performance, when 
compared to using standard RDF reification. 

This is the first work on RSP that has focused on supporting annotations 
on the statement level. We believe that the proposed approach provides an 


8 https: //virtuoso.openlinksw.com/. 
° https: //jena.apache.org/. 
10 https: //wiki.blazegraph.com/. 
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intuitive and compact way for representing and querying statement-level meta- 
data, and that this work provides a good foundation for future research on effi- 
cient management of, e.g., uncertainty and provenance, in RDF data streams. 
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Abstract. The disruptive potential of the upcoming digital transforma- 
tions for the industrial manufacturing domain have led to several refer- 
ence frameworks and numerous standardization approaches. On the other 
hand, the Semantic Web community has made significant contributions 
in the field, for instance on data and service description, integration of 
heterogeneous sources and devices, and AI techniques in distributed sys- 
tems. These two streams of work are, however, mostly unrelated and only 
briefly regard the each others requirements, practices and terminology. 
We contribute to this gap by providing the Semantic Asset Adminis- 
tration Shell, an RDF-based representation of the Industrie 4.0 Com- 
ponent. We provide an ontology for the latest data model specification, 
created a RML mapping, supply resources to validate the RDF entities 
and introduce basic reasoning on the Asset Administration Shell data 
model. Furthermore, we discuss the different assumptions and presenta- 
tion patterns, and analyze the implications of a semantic representation 
on the original data. We evaluate the thereby created overheads, and 
conclude that the semantic lifting is manageable, also for restricted or 
embedded devices, and therefore meets the conditions of Industrie 4.0 
scenarios. 


Keywords: Industrie 4.0 - Data lifting - Asset Administration Shell 


1 Introduction 


Even though the various digital developments and internet-based technologies 
have attracted great attention in the manufacturing industry, a common under- 
standing of the resulting requirements and implications has not been reached. 
The number of different terms, which are being used in this context, reflects this 
challenge — Internet of Things (IoT), Industrial Internet, Cyber-physical Sys- 
tems, Digital Twins and many more have slightly overlapping scopes but still 
depict different applications and features. Still, the primary target is always the 
effective integration and interoperability of industrial devices, services and data 
sources. Therefore, the actual implementations require clear specifications of the 
used data formats, interfaces, and semantic meaning of the referenced objects 
and attributes. 
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IoT data is currently mainly exchanged in either JSON or XML. These com- 
monly used data formats ease the serialization and parsing by providing specifi- 
cations for the syntactic structure of the data objects. Additional information on 
the meaning of keys/values is usually specified in customized data models and 
schemata. The latest specification of the Plattform Industrie 4.0 Asset Admin- 
istration Shell (AAS) also follows this convention [1]. The AAS is promoted as 
the digital twin for the German Plattform Industrie 4.0 and encompasses the 
interpretation of the digital representation of any production-related asset. As 
such, materials and products, devices and machines but also software and digital 
services have a respective digital version. 

While the predefined structure and the usage of specific keys reduce the 
heterogeneity inherent in the data exchange processes of current industrial sce- 
narios, all real-world scenarios still require a thorough understanding of the 
specific terms and values. Therefore they are dependent on extensive manual 
work and understanding of the extended AAS model, followed by a time con- 
suming data mapping. A semantic formalization of entities and data objects has 
several advantages in this context. The mature Semantic Web technology stack 
around RDF enables clear references to classes, properties and instances in the 
form of URIs, beyond the scope of single AAS objects but also across appli- 
cations, domains, and organizations. The defined meaning of the used entities 
further allows its combination with predefined logical axioms, which allow the 
automatic derivation of new knowledge. 

We contribute to the state of the art by presenting a mapping from the latest 
AAS data model to RDF. Thus we provide a data model as an openly accessible 
ontology and create SHACL shapes for all classes to enable schema validation. 
We outline the various pitfalls, especially the different patterns to identify, and 
refer to encoded entities and to links to remote resources. Based on the inher- 
ent Web nature of RDF, we show how the transformation to the semantic data 
model decreases the amount of required storage space. Furthermore, we present 
patterns to directly insert the RDF translation into the original XML and JSON 
files and discuss their implications. Relying on the RDF/XML and JSON-LD 
serializations, we are able to merge the predefined data structure with the seman- 
tically defined data. We show that the provided extension points in the form of 
submodel elements are suitable for this task and that the output AAS files are 
still processable by existing software, therefore the risk of compatibility issues is 
manageable. 

The applicability of the presented approach is evaluated by determining the 
necessary overhead in terms of both storage and computation effort, and by a 
detailed discussion of the restrictions of the RDF version. We show that some 
semantic constructs are more efficient than the originally specified ones, whereas 
others are not directly compatible with the data structure of RDF and some are 
even not expressible at all. 

In this context the paper makes the following contributions: (1) an RDF 
data model of the Semantic Asset Administration Shell SAAS, (2) a mapping 
from XML Asset Administration Shell representations to SAAS, (3) a set of 
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preliminary reasoning axioms in order to explicitly derive implicitly encoded 
information from the data model, and (4) a validation model for this data model, 
encoded through SHACL shapes. 

The remainder of this paper is organized as follows. Section2 contains an 
overview on similar efforts in the field. Section 3 introduces a formalization of 
the regarded domain followed by the presentation of the RAMI ontology and an 
RML mapping in Sect. 5. Section 6 briefly examines several axioms for automated 
reasoning on top of the SAAS, while Sect.7 illustrates the provided SHACL 
Shapes for schema validation. We use several use cases (Sect. 8) to evaluate our 
approach (Sect. 9). Finally, we conclude with a discussion on the potential of the 
SAAS and outline further research gaps. 


2 Related Work 


In this section, we discuss three areas of related work — the data model of the 
Asset Administration Shell, the existing mappings towards a semantic represen- 
tation and related mappings of Industrie 4.0 data models to RDF. 

Barnstedt et al. define the data model of the Asset Administration Shell [1], 
the form of identifiers, access rights and roles, as well as XML and JSON serial- 
izations and their transport. The textual documentation of the model is enhanced 
with XML and JSON schemata. The model defines a basic set of keys and prop- 
erties, and outlines defined points for custom vocabularies and terminologies. 
Part 2 of specification will further determine the APIs and interaction functions 
of the Asset Administration Shell, and how operations can be provided and 
described for the Industrie 4.0 (Fig. 1). 


‘asDataspeafication 
Identifiable 
AssetAdministrationshell 


+ derivedFrom: AssetAdministrationShell” (0..1] Referat 


+ containedElement: Referable* [0..*] 


Fig. 1. Sections of the Asset Administration Shell Data Model according to [1] (page 
44). 


Grangel-González provide a first RDF data model for the Administration 
Asset Shell and the respective technical standards as published by ISO, IECC, 
and DIN [6]. They further extended the work in [5] with a formalized model of 
the Reference Architecture for Industrie 4.0 (RAMI4.0) and entities for units of 
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measurements and provenance, and show a prototypical mapping using R2RML. 
However, the mapping itself was not generally applicable to other Asset Shells 
as a common data model was not specified at this time. 

Tantik and Anderl [11] present an analysis how recommendations of the 
World Wide Web Consortium (W3C) fit to the guidelines of the Plattform Indus- 
trie 4.0. They outline various suggestions how standardized Web technologies can 
be integrated into Asset Shells. The authors present best practices and integra- 
tion methods through a sample implementation scenario but do not discuss the 
implications on the data model itself. 

Mappings of relational or otherwise formatted data to RDF are possible 
with the RDB to RDF Mapping Language R2RML [2] or the broader applicable 
RDF Mapping Language RML [4], which also enables mappings from JSON, 
XML or CSV to RDF. The desired transformations are also formulated in RDF 
by defining the output graph structure by so-called Maps and URI templates. 
While R2RML strictly relies on tables, and uses column names as resource and 
attribute identifiers of row-based data objects, RML also transforms JSON and 
XML data by identifying objects according to their keys. Even though some 
tools have been introduced in order to support the creation of mappings for 
both approaches, the possibility to collaboratively work on mappings was not 
part of the design requirements and is still missing. 

Katie et al. [8] show by integrating the machine-to-machine communication 
protocol OPC-UA for servers and clients how semantic descriptions, in particular 
SAWSDL annotations, bridge the gap between the heterogeneous devices of the 
shop floor. The use of uniquely identified semantic descriptions supports the 
automatic orchestration of decoupled Cyber-physical Systems. However, only the 
specific input and output requirements of the OPC-UA methods are described. 
Neither the data objects nor the OPC-UA general information model is reflected. 

Dietrich et al. examine the semantic characteristics of the Asset Adminis- 
tration Shell in [3]. They outline the identification of attributes and properties 
through cross-industry standards, mainly IEC 61360 and eClQss. In addition, 
they discuss mappings to AutomationML and OPC-UA. However, Dietrich et 
al. do not recognize the concepts of the Semantic Web and therefore do not show 
how to integrate the Administration Shell with its technology stack. 

Currently, to the best of our knowledge, there is no RDF representation 
of the officially released data model of the Asset Administration Shell. This 
is necessary in order to build a bridge between the latest approaches of data 
provisioning models in the manufacturing domain and the rich and mature data 
integration and formalization capabilities of the Semantic Web. As such, an RDF 
data model has the potential to ease the information exchange but also provides 
the capabilities to introduce logical reasoning to the Asset Administration Shell. 


3 Methodology 


The data model for the Industrie 4.0 component aims to provide high coverage of 
the different modeling variants. RDF on the other hand has specific conditions 
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how data is presented (triple-based structure, URI as identifier). In order to 
structure the contribution of this paper, the parts of the respective data models 
are defined as follows: 

AAS captures the information about the Administration Asset Shell itself. 
In this regard, AAS is the digital representation or Digital Twin of the Asset. 
Information from AAS, therefore, refers to the information object or document 
and only indirectly to the original asset. Examples are the creation date of the 
digital representation, manuals, or how the AAS was generated or modified. It 
is important to note that the same reference is used to denote both the Admin- 
istration Asset Shell itself and the set of information contained by it. 

A captures the information about the actual asset. The asset can be anything 
of interest in the context of a digital production setting. Even though assets are 
usually embedded devices or internet-capable components, any physical object, 
such as materials, production goods or machines, can be seen as an asset too. 
In addition, assets also include software components and any digital service or 
intangible thing, which is necessary to model a manufacturing use case. 

S denotes the submodel of the asset shell. Submodels partition the provided 
information and categorize facts according to their usage, for instance as part of 
a documentation submodel or a submodel for quality testing. Submodels are fur- 
ther separated into SubmodelElements, which are either themselves collections of 
SubmodelElements or the final bearer of key-value-encoded facts. As any combi- 
nation of different submodels can be included in the Asset Administration Shell, 
the set S* represents the superset, including all possible submodels. 

I is the set of identifiers for data objects. Specifically I = Igicy U Troc where 
Igich Contains all globally valid identifiers, while the elements of Ioc are only 
valid in their context, in particular inside the AAS, which uses them. 

The concept descriptions denoted with CD may provide further defini- 
tions about the used concepts, mainly attributes and data types. While con- 
cept descriptions are optional components of an AAS, they give the ability to 
place necessary explanations especially for entities with local identifiers close to 
the data. Similarly to submodels, concept descriptions are not limited in their 
appearance, therefore the superset CD! is used. 

An instance aas of an AAS is, therefore, defined by the union of the men- 
tioned sets: 

aas E€ AASUAUS* UCD! (1) 


The identifiers appear in all sets and are therefore not mentioned separately. 
They connect the objects of the different sets with each other. However, the 
nature of identifiers in the AAS data model is mostly the one of foreign keys, 
which do not link directly to the intended object. We define two types of functions 
on the administration shell. First, a serialization ser transforms each adminis- 
tration shell to a representation in a data format, in particular JSON and XML: 
ser: AAS => D={XML, JSON,...} 

Second, a mapping is a transformation m from the data model AAS to the 
Semantic Asset Administration Shell SAAS. SAAS is defined as 


SAAS = AASrpr U Arpr U Shipp UCD pr (2) 
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Using these definitions, an AAS in XML undergoes several steps (see Fig. 2). 
A created SAAS object using the provided mapping (Sect.5) can be sent to 
a reasoning engine (Sect.6) to enrich it with additional facts. Both the native 
SAAS prpr and the enriched S$ AAStn p can be forwarded to a validation module 
(Sect. 7). The validation module creates a validation report, containing the errors 
and inconsistencies against the SAAS schema. Of course, also otherwise created 
SAAS objects can be sent to the reasoning or validation modules (bottom lane). 


mapping |_ serialization 
n rss 
i + schema Validation 
reasonin 
custom | | 
SAASro¢ 


Fig. 2. Process steps through the provided modules. 


4 The SAAS Data Model 


In the following we present the SAAS data model as an RDF ontology!. As 
mentioned, the ontology is an advanced version of the RAMI ontology [6] and, 
therefore, the namespace rami is used. For each class from [1] a corresponding 
OWL Class has been created and every attribute has been mirrored with either 
an ObjectProperty or a DataProperty, except for the ‘semanticId’. The reason 
for the later is that ’semanticld’ links to the unique identifier for the entity. In 
RDF, this is the entity URI itself and therefore does not need to be repeated 
(Fig. 3). 

All RDF entities are supplied with (sub)class assertions, labels and com- 
ments. The SAAS classes reflect the original ones in most cases and form a sub- 
class hierarchy based on the inheritance specification of the AAS data model. 
However, neither RDF nor OWL know abstract classes. AAS uses abstract class 
constructs to partition certain attribute requirements and characteristics. For 
instance, the ‘Has Kind’ class covers all realizations, which contain a ‘kind’ 
attribute. This attribute encodes whether a certain entity is either referring to a 
concrete instance (the explicit machine installed in a shop floor) or is related to a 
whole type (machine type A can be installed in a certain setting). The data model 
reflects the abstract nature through :class skos:note “abstract” statements. 

While the existing schemes for XML and JSON are based on a tree-structure, 
the RDF data model supports a more generic graph structure. While this might 
lead to the conclusion that for every model from AAS); or AAS; son a corre- 
sponding RDF serialization must be possible, therefore AAS C SAAS, we will 
show that some limitations exist and actually AAS D SAAS is the case. 


1 https: //github.com/i40-Tools/RAMIOntology. 
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has Reference 


Admin Shell 
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Fig. 3. Overview on the most important classes and properties of the SAAS 
(For full visualization see http://www.visualdataweb.de/webvowl/#iri=https://raw. 
githubusercontent.com/i40-Tools/RAMI1Ontology/master/rami.ttl). 


5 Mapping to RDF 


The Administration Shell object (AAS) is the root of every Asset Administra- 
tion Shell. Listing 1.1 shows an example XML snippet. As the root entity, it is 
also the entrypoint for traversing the SAAS graph. A native mapping is always 
possible if the identifier is already applied in the form of an URI. However, also 
International Registration Data Identifiers (IRDI) and any other custom format 
is allowed. While IRDIs in case of the wide-spread eCl@ss system can — with 
significant additional efforts — being mapped to URIs, this is in general a very 
hard and error-prone challenge”. This becomes even harder when regarding pro- 
prietary or custom identifiers. In addition, custom identifiers may contain special 
characters as spaces or several hash signs. These characters are percent encoded 
(#4 — %23, changing the appearance of identifiers. As a result, only native URI 
identifiers can be mapped without risk, not only for AAS identifiers but also for 
the other sets in the following. 

A consequence of this decision is also that the ‘Has Semantics’ class and 
the ‘semanticlId’ property of the AAS data model becomes native to all objects. 
Moreover, it implies that all URIs are not only uniquely identifying its data 
object but also supply the semantic definition of their meaning. This rather strict 


? For instance, templates for eCl@ss IDs, e.g. 26-04-07-02 (High-voltage current), may 
map to https://www.eclasscontent.com/index.php?id=26040702. 
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1 <?xml version="1.0"?> 
2 <aas:aasenv xmlns:IEC61360="http://www.admin-shell.io/..."> 
<aas:assetAdministrationShells> 
<aas:assetAdministrationShell> 
<aas: idShort>RaspberryPiModel3B+</aas : idShort> 
<aas:identification idType="URI"> 
http://iais.fraunhofer.de/.../raspberry_pi_3b_plus 
</aas:identification> 
<aas:assetRef> 
<aas:keys> 
<aas:key type="Asset" local="true" idType="URI"> 
https: //iais.fraunhofer.de/.../rspbry/755003377 
</aas:key> 
</aas:keys> 
</aas:assetRef> 


</aas:assetAdministrationShell> 
</aas:assetAdministrationShells> 
9 </aas:aasenv> 


Listing 1.1. XML serialization of the Raspberry Pi AAS. 


ANATIEWNrFOVCANATHW 


requirement can be further aligned with the Linked Data Principles if URIs are 
also enforced to point to actual resources. However, dereferencable URIs are not 
a requirement for now but should be seen as a preferable best practice. 

The asset objects (A) constitute the link from the AAS to the real-world 
thing. As assets themselves only contain a very brief description, only the class 
assertions (rdf:type), the name (rdfs:label), descriptions (rdfs:comment) and the 
kind attribute are translated to Arpr. 

Submodels (S) and SubmodelElements are the core information carrier of 
the Asset Administration Shell. The basic structure of the submodel serves as a 
bracket for several SubmodelElements. Abstract SubmodelElements can be real- 
ized by Operations, ReferenceElements, Files, binary objects (Blob) and Prop- 
erties. Properties have further attributes such as a key, value, value type and 
several others. In order to align the Property class with the graph model of 
RDF, each instance is transformed to a respective rdf:Property. Therefore, a 
distinct class ‘Property’ does not exist in SAAS. The alternative usage of n- 
ary relations, which would further allow the linking of more attributes to the 
relation, was discarded in order to sustain cleaner graphs. Consequently, not all 
Property objects can be translated to the SAAS model.’ 

Mainly, attributes and properties are converted to triples and identifiers are 
restricted to URIs. Therefore, all identifiers of attributes become globally valid, 
as URIs are globally valid. It has been deliberately decided against n-ary con- 
structs with blank nodes and an explicit property class, which would have been 
closer to the XML and JSON influenced data model. The reason is that an 


3 Examples can be found at https://github.com/i40-Tools/RAMIOntology/tree/ 
master /AssetAdministrationShell_examples. 
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_:AssetShellMap a rr:TriplesMap ; 


rr:subjectMap [ 
rml:reference "identification" ; 
rr:class rami:AdminShell ] ; 
rr:predicateObjectMap  [ 
rr:predicateMap [ rr:constant rdfs:label ] ; 
rr:objectMap [ 
rml:reference "idShort" ; 
rr:termType rr:Literal ; 
rr:datatype xsd:string ] 


PRR 
NrFOOANDUBWNHH 


Listing 1.2. Example RML TriplesMap excerpt. 


BR 


<http://iais.fraunhofer.de/en/aas/examples/raspberry_pi_3b_plus> a rami:AssetShell; 
rdfs:label "RaspberryPiMode13B+" ; 
rami:hasAsset "http://iais.fraunhofer.de/en/aas/devices/rspbry/755003377"; ... 


wn 


Listing 1.3. Equivalent representation to Listing 1.1 as RDF/Turtle. 


thereby created graph increases in complexity while its comprehensibility signif- 
icantly decreases and the information content stays the same.* 

Concept description objects (CD) serve as local dictionaries for used entities. 
As the proliferation of definitions and metadata directly with the productive data 
eases its interpretation, Concept Descriptions increase the degree of interoper- 
ability between AAS providing and consuming components. RDF and Linked 
Data however propagate the usage of dereferencing URIs in order to retrieve 
metadata. In that sense, Linked Data conventions can reduce the amount of 
transmitted data. On the other hand, not all relevant Industrie 4.0 components 
are able to actively request such metadata. The possibility to independently 
open outgoing interactions beyond the restricted shop floor network is usually 
also a security risk and is not a good practice. Therefore, Concept Descriptions 
are a valuable feature to ship metadata and to ensure a common understanding 
on the shipped AAS. The mapping itself is provided as RML TripleMaps (see 
Listing 1.2) and can be executed with the open-source tool RMLMapper?. 


6 Reasoning 


RDF and RDFS already contain trivial entailment rule sets°. As RDF and RDFS 
are very general vocabularies, the allowed reasoning focuses on the syntactic 
position (subject, predicate, object) of entities in RDF graphs. For instance, the 
information that p is an instance of the class Property can be inferred from 
the fact that a triple with p at the predicate position exists. Although rule 


‘Full example: https://github.com/i40-Tools/RAMIOntology /tree/master/rml_ 
mapping/mapping_examples. 

5 Accessible at https: //github.com/RMLio/rmlmapper-java. 

6 https: //www.w3.org/TR/rdfl1-mt/. 
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entailments of this kind are certainly correct, the created amount of explicit 
data increases significantly while the information content stays nearly the same. 

In order to illustrate the power of reasoning based on the SAAS, selected 
rule sets using owl:sameAs and rdfs:subClassOf properties have been prepared. 
The rules are encoded in N3 according to Stadtmiller et al. in order to use 
their Linked Data Integration and Reasoning Engine [10]. In addition to the two 
entailment regimes, both consisting of several single rules’, the SAAS ontology 
with its inherent axioms is integrated on the fly. Section 9.3 presents the results. 


7 Schema Validation 


The AAS presents a closed-world model. As such, the definitions of classes and 
properties must be regarded as restrictions and simply reusing properties, which 
were introduced for class A, for class B usually causes a violation of the model. 
RDF on the other hand does by default allow all not excluded patterns. Nev- 
ertheless, industrial use cases require verifiable statements on the data content 
but also its structure. 

The Shapes Constraint Language (SHACL) [9] introduces a W3C recom- 
mendation for validation mechanisms on RDF graphs. The definition of required 
attributes, cardinality of relations or datatype restrictions in the form of shapes 
is an important aspect to enable data quality assurance in any productive sys- 
tem. Some tools are already created to assist the creation of SHACL shapes, 
e.g. a Protégé plugin and as a part of TopBraid Composer. As SHACL shapes 
are also defined in RDF, they share the same format as the validated data in 
contrast to e.g. plain SPARQL Rules. This eases the required technology stack 
and reduces the amount of used libraries. 

The SAAS supplies respective shapes for all its classes®. These shapes mainly 
check for mandatory properties but also check the existence of label and com- 
ment annotations. In addition, the shapes are essential in order to check the 
incoming data during the exchange of Asset Administration Shells. Further- 
more, the shapes can also be used to describe input and output specifications. 
For instance, an Industrie 4.0 component can postulate that its API requires 
data objects conforming to the Asset Shape and will output Submodel objects 
as defined by the Submodel Shape. 


8 Use Cases 


We use three different Asset Administration Shells in order to evaluate our app- 
roach. All of them are reflecting the specifications from [1] and are in the AASX 
file format. The corresponding descriptions are included in XML files contained 
in the AASX files. 


T rdfs9 and rdfs11 from [7], transitivity, symmetry and replaceability characteristic for 
owl:sameAs. 
8 https: //github.com/i40-Tools/R.AMIOntology /tree/master /schema. 
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Raspberry Pi. The first Asset Administration Shell represents a Raspberry 
Pi 3B+ (see Listing 1.1). Three Submodels are included, namely one for the 
technical characteristics, one containing documentation material as the product 
sheet and a usage manual, as well as one submodel explaining the asset itself. 
Here, the asset is one specific Raspberry Pi (kind=instance) and not referring 
to the type of product of all Raspberry Pis, which have been produced or will 
ever be produced (kind=type). Therefore, the description is only valid for one 
and only one Raspberry Pi. The AAS delivers 52 SubmodelElements. 


Automation Controller. AAS2 describes an electronic controller for automa- 
tion facilities. As it is not approved as an official artifact, the providing company 
as well as its details can unfortunately not be published. AAS2 contains one asset, 
three submodels and more than 100 SubmodelElements. 


Multi-protocol Controller. The third use case (AAS3) represents an internet- 
capable controller unit with multiple protocol support. Like AAS2, this Asset 
Administration Shell is not officially published yet. However, none of the authors 
of this paper was involved in the creation of either AAS2 or AAS3. The third 
use case includes one Asset with eight Submodels and more than 150 Submod- 
elElements. 


9 Experimental Evaluation 


We evaluate the AAS to SAAS mapping by examining the results and the per- 
formance of the three use cases (see Table 1). As a reference to estimate the 
information coverage, the number of XML nodes of the AAS serializations are 
provided. In addition, the amount of unique leaves of the three XML trees are 
noted, as these numbers better reflect the single information content of the AAS. 
Table 1 also presents the numbers of generated triples by the RMLMapper. The 
comparison indicates, as already mentioned, that not the whole expressiveness 
of AAS can be transported to the SAAS version. This is due to the fact that 
some constructs can not being represented sufficiently in RDF (for instance the 
Property class) but also many original entities contain redundant information. 
Especially the ConceptDescriptions repeat many attributes, which are collapsed 
by the mapping process and only added once. 


Table 1. Results of the SAAS mapping and RDF serialization. 


#XML Leaves / | AAS #Triples SAAS SAAS SAAS SAAS 

#XML Nodes (XML) (XML) (nquad) | (turtle) (JSON-LD) 
RaspberryPi| 1161/2864 148 KB 510 40 KB 86 KB 32 KB 51 KB 
AAS2 925/2604 91 KB 459 17 KB 58 KB 12 KB 20 KB 
AAS3 2651/6743 313 KB 1154 43 KB 156 KB 31 KB 52 KB 
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9.1 Mapping Time 


The necessary overhead in terms of computation time measured in milliseconds is 
presented in Fig. 4, in addition to the average mapping times outlined in the last 
column of Table 1. The time was measured on a regular laptop (Win10, 16 GB, 
Intel i5-7300 2,60 GHz) using a bash emulation. The different RDF serializations 
do influence the execution time, indicating that the writing is not the bottleneck. 
While the average mapping time of the Raspberry Pi AAS (2,7s) and AAS2 
(3,18) are rather close, the duration for AAS3 (5,7 s) is significantly higher. The 
variation between the selected use cases reflects the differences in their XML file 
size. This could indicate that the overall behavior is nearly linear. However, each 
of the 19 TripleMaps leads to a reloading and reiteration of the whole XML file. 
Overcoming this expensive process would speed up the process significantly but 
is out of the scope for this paper. 


Mapping Time [ms] 


O nquads E turtle WN json-ld 


6000 = Sa 
3000 Ce SES ees 


RaspberryPi AAS2 AAS3 


Fig. 4. Mapping times for the three Asset Administration Shells. 


9.2 Data Overhead 


RDF is in general not an effective data format in terms of storage efficiency. Nev- 
ertheless, the syntax requirements of the AAS and especially its XML schema 
create already significant overhead for the original AAS model. As depicted in 
Table 1, all RDF serializations reduce the necessary storage size. Especially note- 
worthy is the difference between the original XML file size and the RDF/XML 
serialization. This is mostly due to the usage of namespaces in the RDF/XML 
version, which reduces the noted URIs. It should be mentioned that for all serial- 
izations the mapping step (m) and the serialization (ser) were executed directly 
by the mapping engine. 

Nevertheless, the resulting costs in terms of storage requirements and commu- 
nication bandwidth do not exceed the ones created by the original Asset Admin- 
istration Shells. Consequently, all devices and scenarios capable of handling AAS 
are also sufficient for the operation of SAAS. Furthermore, the possible serial- 
ization of SAAS as both XML and JSON should enable AAS implementations 
to quickly adapt to SAAS objects in their original file format. 
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Table 2. Added triples by the different rule sets. 


Triples sameAs sameAs subClassOf | subClassOf | both both 

(original) | (triples) | (time) (triples) (time) (triples) | (time) 
RaspberryPi 510 959 2,760ms | 771 2,719 ms 1,217 2,808 ms 
AAS2 459 452 3,057ms | 367 2,368 ms 570 2,313 ms 
AAS3 1154 1,115 2,776ms | 818 2,677 ms 1,343 2,668 ms 


9.3 Reasoning 


Three different rule sets have been applied to all use cases. All rule sets contain 
a web request to the ontology source file in order to load the class hierarchy 
and any other relevant axioms of the data model itself. The first one also adds 
several rules reflecting the symmetry and transitivity of owl:sameAs as well as 
the fact that same instances share all properties and annotations of each other. 
The second rule set contains subclass statements as encoded by the rules rdfs9 
and rdfs11 [7]. The third set combines both to the most expressive reasoning 
set. Table 2 gives an overview of the amount of created triples. rdfs:subClassOf, 
owl:sameAs and the combination of both entailments are shown with the amount 
of uniquely added triples and the average reasoning time. 

We use the Linked Data-Fu engine [10]. The preparation of the reasoning 
engine, involving the parsing of the rule files, takes around 1s. The following 
web request, the download of the ontology, the evaluation of the rules and the 
serialization to a n-triple file is then executed. The duration distribution of ten 
repetitions is shown in Fig. 5. One can see that the whole process takes between 
2,3 and 3,3s, nearly independently of the amount of inputs (AAS3 is significantly 
larger than the graph for the Raspberry Pi) and the expressiveness of the rule 
sets (the second set is leading to way less results than the others). 

As the rule sets are only regarding the structure of the ontology, the infer- 
encing of context-dependent knowledge is not yet possible. In order to reach 
productively usable information, domain-specific axioms tailored to the actu- 
ally contained or expected data is necessary. However, we can show that the 
reasoning process with complex rules is applicable in an acceptable amount of 
time. 


9.4 Schema Validation 


The evaluation times of the SHACL shapes are shown in Fig. 6. On average, the 
execution of all shapes takes 46,2s and the execution of one single shape 1,8s. 
All shapes have been executed a total of ten times. 

About 2s are required for setting up the validation tool and parsing the data 
shape (the Asset Administration Shell) and the single class shape. The size of 
the Asset Administration Shell has no significant impact on the achieved results. 
Regarding these conditions, we claim that the necessary effort is acceptable for 
a typical Industrie 4.0 scenario as the validation itself is not necessary for every 
restricted devices. This is due to the fact that the validation of data takes either 
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Fig. 5. SAAS Reasoning duration. Fig. 6. Schema validation performance. 


place at development or deployment time where time is not critical. In addition, 
the validation is important for the higher-level data analytical services which 
usually run on more powerful machines or are even hosted in the cloud. 


10 Conclusion and Outlook 


We presented a semantic version of the Administration Admin Shell, a mapping 
from its XML serialization to any RDF serialization, schema validation shapes 
and a brief set of reasoning rules. In that sense, we showed the lifting process of 
the AAS data to a semantic integration layer. 

This is one step to an automated integration of Industrie 4.0 components. 
We showed how existing, non-customized tools can work with the RDF model 
of the AAS and execute their task without prior configuration. This enables the 
implementation of real interoperable pipelines and data-driven workflows, not 
only on the data format and syntax level but also regarding the meaning of the 
data. Furthermore, the examined overhead of the SAAS model and showed that 
the requirements do not exceed the requirements set by the original AAS model. 

The mapping provided in this paper outlines the data lifting to the SAAS 
RDF model. The lowering of RDF to the original AAS data model has not yet 
been achieved. Furthermore, the main benefit of the semantic model is, besides 
its formalized meaning, the interlinking with other definitions and the integration 
of additional sources. 

For now, only the data provisioning capabilities of the AAS are defined. 
In the next step, the provisioning and invocation of operations through Asset 
Administration Shells will be specified. Using semantically defined descriptions 
of the respective interfaces, their input and output parameters and the provided 
services will allow the Industrie 4.0 community to rely on the huge amount 
of expertise and experience with Web Services and Semantic Web Services in 
particular. This way, the goal of truly interoperable and flexible manufacturing 
workflows, where software and hardware, materials and products, costumers and 
suppliers form on demand information chains, benefits from the huge amount of 
existing research in the area. 
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We will further extend our work in order to keep the semantic models aligned 
with the progress of the Asset Shell specification. Furthermore, we provide feed- 
back and outline established best practices to the manufacturing community. 
Furthermore, we see two main challenges which must be tackled by the semantic 
community. First, the core potential of the semantic web — the seamless inte- 
gration of heterogeneous devices, services and data sources — still lacks sufficient 
numbers of implemented use cases and deployed scenarios in practice. Second, 
the reoccurring discussion on identifiers in distributed settings is a huge chance 
for the established practices of the Semantic Web and Linked Data in particular. 
However, the benefits of (dereferencable) URIs are still underestimated in the 
manufacturing community, mostly because of missing experiences. 
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Abstract. Customer service agents play an important role in bridging 
the gap between customers’ vocabulary and business terms. In a scenario 
where organisations are moving into semi-automatic customer service, 
semantic technologies with capacity to bridge this gap become a neces- 
sity. In this paper we explore the use of automatic taxonomy extraction 
from text as a means to reconstruct a customer-agent taxonomic vocab- 
ulary. We evaluate our proposed solution in an industry use case scenario 
in the financial domain and show that our approaches for automated term 
extraction and using in-domain training for taxonomy construction can 
improve the quality of automatically constructed taxonomic knowledge 
bases. 


Keywords: Taxonomy extraction - Knowledge base construction - 
Conversational agents 


1 Introduction 


Customer service agents are charged with the role of identifying the intention 
behind customer questions, retrieving the relevant business information that 
address those queries, and expressing it in a form that the customer is able 
to understand. However, the increasing demand for customer services and the 
necessity to scale up a human workforce are an ongoing challenge for businesses. 
To address that, organisations are turning to semi-automatic customer service 
through the use of digital conversational agents (DCAs, also referred to as chat- 
bots) for primary contact with customers, leaving human agents to deal mostly 
with unusual or more complex customer queries. 

Some of the challenges faced by DCAs are: the acquisition of domain knowl- 
edge, and knowledge representation that can be audited by the business. Whereas 
human agents may pass through training programmes to understand the busi- 
ness they are working with, DCAs not only need to acquire knowledge about the 


© The Author(s) 2019 
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business but also about the way customers express their informational needs. 
Further, while it is expected that human agents would be responsible for the 
provision of the correct information needed by the customer, DCAs cannot be 
held legally responsible. Instead, businesses need to have means to audit the 
knowledge used by these DCAs in order to minimise the error in the retrieval 
of information. Therefore, DCAs need to make use of knowledge representation 
mechanisms that are applicable to their tasks but also interpretable by humans. 

In this paper we propose a solution for the automatic generation of tax- 
onomies from customer service dialogue data. Such solution contributes to the 
conversational agent use case by taking advantage of existing dialogues between 
customers and agents as learning data about the domain, and by automatically 
generating taxonomies as semantic structures for auditing the knowledge used 
by the conversational agent. The main contributions of this paper are: the com- 
parison of state-of-the-art Term Extraction features, the proposal of Taxonomy 
Likelihood Scores to measure how likely a tree represents an actual taxonomy, 
and the applicability of our solution to an anonymised financial customer service 
use case provided by Fidelity Investments. 


2 Related Work 


DCAs need to have domain-specific knowledge in order to be effective interfaces 
for human-computer interaction. Most attempts to represent such knowledge 
have been based on extracting information directly from textual sources, such as 
online discussion forums [19]. However, it has been identified that a structured 
form of knowledge can provide a useful intermediate step. Sdnchez-Dfaz et al. [26] 
used a logical representation to apply chatbots as intelligent tutors, whereas 
Semantic Web representations have been used in OntBot [3], through manually 
designed ontologies, and DBpedia bot [7], by using an open domain knowledge 
graph. However, none of these solutions is based on automatically generated 
domain-specific taxonomies. (See Abdul-Kader et al. [1] for a survey.) 

The approach we have defined in this paper consists of two steps. First we 
extract the terms that are most relevant to the domain, a task referred to as auto- 
matic term recognition (ATR). Current approaches to this task have employed a 
varied suite of methods for extracting terms from text based on parts of speech 
and metrics for assessing ‘termhood’ [15,29], domain modelling [11], and the 
composition of multiple metrics in an unsupervised manner [5]. More recently, 
these methods have been combined into off-the-shelf tools such as ATRAS [7] 
and JATE [31], and our system is a similar implementation to ATRAS. 

The second step organizes these terms into a taxonomy. Although similar to 
hypernym learning [28], the challenges it proposes are quite different (see [10]). 
Multiple string and grammar-based methods have been proposed, where baseline 
systems have used string-based metrics with Hearst-like patterns learned from 
text [23], while more advanced ones have been based on the concept of endocen- 
tricity of terms to indicate a hypernym-like relationship [30]. Other methods not 
based on grammar such as genetic algorithms [14] or word embeddings [17, 27] 
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have also been explored. We include these approaches in our system thus pro- 
viding an integrated solution to both term and taxonomy extraction. 


3 Task Definition 


Taxonomy extraction consists of extracting a set of terms from a corpus and 
organising them into a taxonomic structure [10], i.e. a hierarchical tree structure 
where terms closer to the root of the tree are considered broader! than terms 
farther to the root. 

Our framework for taxonomy extraction from text is divided into two steps: 
automatic term recognition and taxonomy construction. The term recognition 
step aims at detecting the most relevant terms (also referred to as keyphrases) 
for a domain represented by a corpus. Based on this list of terms, the taxonomy 
construction step aims at finding a suitable taxonomy that maintains the correct 
broad/narrow relationship between terms. 


Definition 1. Given a set of terms T in which each t € T is represented by a 
label tı. A tatonomy, T = (T,C), is then defined as a partial ordering of a set 
of terms, T, satisfying the following constraints: 


- (Reflexivity) t Ct Yt €T 

— (Antisymmetry) tı E tg and t2 E tı if and only if tı = te Vty,t2 ET 
(Transitivity) ty L to and t2 E t3 implies ti E t3 Vti, ta, t3 ET 

(Unique parent) if tı E to and tı E ta then ty E ts or ts E te Vti, te, ts € T 
(Single root) There is some element r E€ T such thatt Cr YteT 


4 Automatic Term Recognition 
Our proposed solution for term recognition uses a corpus-based approach based 


on a pipeline of four consecutive steps (Fig. 1): (i) identification of candidate 
terms, (ii) scoring, (iii) ranking, and (iv) filtering. 


Text 
Dataset 


List of 
Terms 


Identification of 
Candidate Scoring 
Terms 


Ranking Filtering 


Fig. 1. Automatic term recognition pipeline. 


The identification of candidate terms consists of identifying the key 
terms appearing within each document. This is accomplished by detecting all 
noun phrases appearing in the corpus that: 


1 By ‘broader’ we do not mean hypernymy, but that the topic has a wider scope. 
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— Contain a minimum and a maximum number of n-grams. 

— Do not either start or end with a stopword [20]. 

— Follows a set of part-of-speech patterns empirically found to be associated 
with terms [9,20]. 

— Occurs within the corpus at least a given number of times. 


The scoring step provides a quantitative measurement for the relevance of 
each candidate term to the domain in the corpus. As the notion of relevance 
changes from one application area to another, the scoring step can make use of 
one or multiple scoring functions more suitable for the underlying domain and 
task. In this work we will explore the use of multiple combinations of scoring 
functions for the choice of terms in the financial customer service domain. 

Formally, consider C a corpus and t € C a candidate term extracted in the 
step for identification of candidate terms. The score for a given term t is a n- 
tuple score(t) = (fi(t), fo(t),..., fn(é)) given by a set of functions that indicate 
the relevance of t in C for a task T. 

In this work we evaluate scoring functions within four categories: 


— Frequency of occurrences: scoring functions that consider only frequencies of 
candidate terms within the corpus and/or frequency of words occurring within 
candidate terms (TF-IDF, Residual IDF [13], C Value [4], ComboBasic [6]). 

— Context of occurrences: scoring functions that follow the distributional 
hypothesis [18] to distinguish terms from non-terms by considering the dis- 
tribution of words in their contexts (PostRankDC [11]). 

— Reference corpora: scoring functions based on the assumption that terms can 
be distinguished from other words and collocations by comparing occurrence 
statistics in the dataset against statistics from a reference corpus - usually of 
general language/non specific domain (Weirdness [2], Relevance [24]). 

— Topic modelling: scoring functions based on the idea that topic modelling 
uncovers semantic information useful for term recognition, in particular that 
the distribution of words over the topics found by the topic modelling is a 
less noisy signal than the simple frequency of occurrences (Novel Topic Model 
(NTM) [21]). 


The ranking step sorts all candidate terms from the most relevant (i.e. high- 
est score value) to the least relevant (i.e. lowest score value). However, depending 
on the amount of scoring functions used, the ranking mechanism will be different: 


— Single score: where only one score function is used, all terms are sorted in 
ascending order of their associated score value. 

— Voting: when more than one scoring function is used, the ranking is based on 
the voting mechanism from [32] and happens in two steps. In the first step, 
the single score procedure is applied to each scoring function used, resulting 
in a set of ranked lists R — one list per scoring function. Next, the final ranking 
position for a candidate term t is given by Eq.1 where n is the number of 
scoring functions used and R;(t) is the rank position of t as provided by the 
scoring function i. 
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rank(t) = 2, Rw (1) 


Last, the filtering step keeps only the top n terms after the ranking step 
(where n is a parameter provided to the algorithm). 


5 Taxonomy Construction 


Taxonomy construction aims to build a taxonomy based on the terms extracted 
by the automatic term recognition algorithm. The proposed pipeline consists 
of two consecutive steps (Fig.2): (i) pairwise scoring, and (ii) search. In the 
first step, each pair of terms extracted {V(c,d) € T|c 4 d} will receive a score 
referring to the estimated probability p(c E d) that c is a narrower term than d. 
This score will be based on the terms themselves and on their corpus frequency. 
Based on this set of scores, the second step will search for a tree structure 
that maximises the likelihood of it being a taxonomy, according to a pre-defined 
taxonomy likelihood score. The result of this process is a taxonomy T containing 


all the terms provided as input. 


Text 
Dataset 
Pairwise 
List of 
Terms 


Fig. 2. Taxonomy construction pipeline. 


5.1 Pairwise Scoring 


Pairwise scoring aims at identifying, for a pair of terms (c, d) where c Æ d, if c is 
a narrower concept than d (c E d). In this work we accomplished this by using 
a supervised learning setting. 

For each pair of terms (c,d) a feature vector is created with features from 
the following four categories: 


— String-based features: Features in this category presume that shorter terms 
embedded within longer ones are more general. For instance, ‘funds’ is more 
general than ‘mutual funds’. Features in this category are: Inclusion, overlap, 
and longest common substring (LCS). Inclusion is +1 if c is totally contained 
within d, —1 if d is totally contained within c, or 0 otherwise. Overlap rep- 
resents how many words are shared between two terms. Last, LCS measures 
the longest string of characters shared by two terms. 
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— Frequency-based features: This category assumes that the generality or speci- 
ficity of a term influences its frequency in the domain corpus. Features in this 
category are: relative frequency and term difference. Relative frequency mea- 
sures the difference between the frequency of two terms (Eq.2 where f(z) 
denotes the frequency of term x). Term difference measures the difference 
in the distribution of two terms (Eq.3 where D(x) denotes the number of 
documents in which the term z appears). 


= tog f o 
relativeFrequency(c, d) = log( fle) (2) 

_ DO) N D(d)| 
termDif ference = Dal (3) 


— Word embedding features: Features in this category intend to capture the 
generality of a term by the relation it has with other terms in a shared latent 
semantic space. For that, GloVe [25] vectors are gathered for each word within 
a term c in order to generate a vector ve for the whole term using Single Value 
Decomposition (SVD) [27]. Two approaches are used, each one leading to a 
different word embedding feature: SVD average and SVD minmax. The word 
embedding (WE) features used for pairwise scoring between a pair (c, d) will 
be calculated according to Eq. 4. 


WE(c,d) = v? Ava (4) 


— Lexical features: Features in this category take advantage of existing lexical 
databases (e.g. Wordnet [22]) with information on the generality of terms 
(ie. c E d). Features available are: complete hypernyms (CH) and word 
hypernyms (WH). The CH feature measures if both terms appear related 
(directly or indirectly) within the background database, whereas the WH 
feature measures the presence of relations between any two pairs of words in 
terms c and d according to Eq. 5. 

WH(c,d) = #words(t.) + #words(ta) 5 CH (we, wa) (5) 

2 
A SVM (Support Vector Machine) classifier? is then trained using a manually 
created taxonomy with an associated corpus in the same domain. Each pair of 
terms (c,d) where c Æ d and cis a child of d in the training taxonomy is labelled 
with +1, otherwise the pair is labelled as —1. The result of the classification is 
the probability estimation for the class +1 which is then given as the estimate 

for p(c E d). 


5.2 Search 


Based on the pairwise score between any two terms provided as input to the 
taxonomy construction, the search step aims at identifying a tree structure that 


? Earlier versions of the system experimented with other classifiers, however we have 
found that SVMs provide higher quality and robust performance. 
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represents a taxonomy of these terms. In order to identify how close a tree 
structure is to a valid taxonomy, a taxonomy likelihood score was designed. In 
this work we design and evaluate three score functions: transitive, non-transitive, 
and Bhattacharyya-Poisson. The goal for the search is to find a tree-structure 
that maximises the taxonomy likelihood score. Here we also experimented with 
two types of search methods: greedy and beam. 


Taxonomy Likelihood Score 


Transitive. The transitive score (Eq. 6) just follows the basic assumption that 
the best taxonomy is the one that maximises the product of all p(c E d) for all 
(c,d) pairs of terms. In practice, we take logs of the probabilities and maximise 
the resulting sum (Eq. 7). 


S(T) = [| [eE (6) 


max S(T) = max 5 log p(c E d) (7) 


Non-transitive. In practice, the transitive score is not expected to work well 
since it is maximised by taxonomies for which there are as few as possible pair 
of terms (c,d) such that c E d. The most trivial case is when a taxonomy is 
composed only by a single root term and all other terms are a direct child of it. 
As such, taxonomies that are constructed from maximising the transitive score 
tend to have a very large number of average children. In order to avoid that, the 
non-transitive score (Eq. 8) considers only the set of direct children, which are 
denoted by c < d and should satisfy the following constraints: 


— c < d implies cE d. 
— cC d implies there exists e such that c < e and eC d. 
— For all, c < d there does not exist e, e # d, e # c, such that c < e and e < d. 


SnonTransitivel T) = II phc E d) (8) 


c<d 


Bhattacharyya-Poisson. Despite the possible improvement given by the non- 
transitive likelihood function, it may still lead to a single path (i.e. a tree with 
just one branch), what differs from usual expectations for taxonomies as more 
balanced trees (i.e. tree structures with multiple branches). In order to address 
this, the Bhattacharyya-Poisson likelihood score takes into account the number 
of children of each node in the tree. 

Formally, let ne denote the number of terms in a taxonomy that have exactly 
c children. If the tree was to be constructed in a truly random fashion we would 
expect ne to be distributed according to a binomial distribution (Eq. 9). However, 
in a completely balanced tree of N terms there are N — 1 direct children so that 
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the number of children in each branch is p = We) where b is the number of 
branches. In order to allow us to vary the average number of children in each 
branch (A), since taxonomies do not need to be completely balanced, we use the 
Poisson distribution as an approximation for the binomial distribution (Eq. 10). 
However, the constraints on what constitutes a taxonomy fix this value to very 
near 1 and we wish to vary this in order to control the taxonomy structure. We 
thus ignore the leaf nodes in the taxonomy (i.e., we ignore no). 


pne = m) = choose(N,m)p™(1 — p)N-™ (9) 
mo—~A 
P(N =m) ~ a (10) 


In order to measure the similarity of the candidate taxonomy with this the- 
oretical probability value we use the Bhattacharyya distance [8] that measures 
the similarity of two (discrete) probability distributions p and q as provided in 
Eq. 11. If we compare this to the actual children count in a taxonomy, we can 
score a taxonomy as provided in Eq. 12. Finally, this is interpolated with the 
previous metric to score a taxonomy as provided in Eq. 13. 


B(p,q) = > yPiqi (11) 


N m) ™e™à 
BP(T) = D N noni (12) 
Spp(T) = | | p(e Ed) + aN x BP(T) (13) 
c<d 


Search Method. Having chosen the likelihood score, the next step is to use a 
search strategy that optimises this likelihood score. We used two search strate- 
gies: (i) greedy, and (ii) beam. 


Greedy. In this method, provided the pairwise scores for all pairs of terms, the 
pair that has the maximal score for p(c E d) is added as c < d. This process is 
repeated until a full taxonomy is constructed, that is we take pairs c < d that 
satisfy the first four axioms of the taxonomy, from the first (reflexivity - unique 
parent) until the final axiom is satisfied (single root), which means the taxonomy 
is complete. 


Beam. In contrast to the greedy method, where only a single partially con- 
structed taxonomy is maintained, this method keeps a list of the top scoring 
possible solutions. This list (also called beam) is of a fixed size, thus the addi- 
tion of a new partial solution may cause the least scoring partial solution to be 
dropped. Complete solutions are stored in a separate beam and the algorithm 
proceeds until it has considered all possible pairs of concepts and returns the 
highest scoring complete solution found. 
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6 Experiments 


6.1 Automatic Term Recognition 


Our experiments were performed on an anonymised customer service chat- 
log dataset containing 300,000 conversations provided by Fidelity Investments, 
where any personal information was removed prior to using the data for this 
project. In order to use such dataset, first we need to decide what is the unit of 
analysis or, in other words, what would be considered a document in the corpus. 
There are two obvious options: each interaction made by a customer or an agent, 
or the whole customer-agent conversation. In our experiments we decided to use 
the whole conversation as unit of analysis for two reasons: (i) to give priority to 
terms frequent in conversations rather than those cited multiple times only in a 
single conversation; and (ii) because different elements in a conversation could 
provide contextual information for the terms extracted. 

Also, in order to protect customers’ identities and personal data, the corpus 
provided is full of anonymisation tokens, i.e. tokens that were put in place of 
what would be sensitive information (e.g. name, email address, etc.). Before 
conducting any experiment, a list of stopword terms was compiled containing 
all anonymisation tokens appearing in the corpus so that these would not be 
captured as terms due to their potential high frequency in the corpus. 

After preprocessing the corpus, several experiments were conducted in order 
to identify the most suitable configuration for automatic term recognition in the 
customer-agent interaction scenario. First, we adjusted the hyper-parameters for 
identification of candidate terms: the part-of-speech patterns used are given by 
the literature [9,20], the list of stopwords includes common English stopwords 
and the anonymisation tokes extracted, a relevant term should have frequency 
higher than 2, and it should not have an unlimited size so we choose a maxi- 
mum of 4-grams and we varied the minimum n-gram between 1 and 2. Table 1 
summarises the configuration of each experiment. 

For scoring functions we choose TF-IDF as a baseline due to its common use 
in measuring relevant terms in a corpus. We also opted to have settings with one 
function from each category to measure how they behave independently, except 
the reference corpora category since using Wikipedia as background corpus could 
give a high number of false positive terms if used alone. Also, we included a 
configuration (TE, and TE) that has demonstrated positive results in previous 
experiments. And last, due to the positive results of the use of ComboBasic (TE 
long) we experimented combining it with a reference corpora scoring function. 
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Table 1. Configuration of each automatic term recognition experiment 


Experiment n-gram Scoring function Ranking 
Min | Max | Frequency Context Reference | Topic modelling 
corpora 

TE baseline il 4 tf-idf - - - Single 
TE baseline long | 2 tf-idf - - - Single 
TE, 1 4 tf-idf - Weirdness | - Voting 

Residual Idf 

C Value 

ComboBasic 
TE, long 2 4 tf-idf - Weirdness | - Voting 

Residual Idf 

C Value 

ComboBasic 
TE2 1 4 ComboBasic | - - - Single 
TE2 long 2 4 ComboBasic | - - - Single 
TE3 long 2 4 ComboBasic | - Relevance | - Voting 
TEa 1 4 - PostRankDC | - - Single 
TE, long 2 4 - PostRankDC | - - Single 
TEs il 4 - - - NTM Single 
TEs long 2 4 - - - NTM Single 


Results and Discussion. The evaluation of the automatic term recognition 
experiments was based on the manual evaluation of a list of terms. A group 
of domain experts from Fidelity Investments who are familiar with financial 
industry taxonomies were asked to evaluate the relevance of each term to the 
financial domain according to a 5-point Likert-type scale (from irrelevant to very 
relevant). Any term rated as either 4 or 5 by a majority of annotators (i.e. at 
least four annotators) was considered a relevant term in the financial domain. 

The list of terms for evaluation was generated by merging the top 100 terms 
extracted by each experiment, removing duplicates and ranking them using the 
Mean Reciprocal Rank [16]. The final list was then limited to a manageable 
number of terms (200 terms) sent for manual validation by a team of experts in 
the financial customer service domain. Since each term may appear in more than 
one experiment, Table1 reports how much of the evaluation list is covered by 
each experiment. The result of each experiment is then evaluated using precision, 
i.e. the proportion of correct terms among those appearing in the evaluation set. 
Based on this evaluation, the experiments TE log and TE 3 were the ones to 
provide the best results in our experiments (Table 2). 

The positive results using the ComboBasic feature are mostly due to its abil- 
ity to remove conversational words (such as “good morning”, “thanks” ). Because 
greetings do not appear in the corpus either as part of longer combination of 
words (e.g. “good morning” will not typically be a substring of any longer noun 
phrase), or as an aggregation of smaller and frequent terms (e.g. “morning” is 
not a frequent term in the corpus and “good” is not even considered a term for 
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Table 2. Evaluation of automatic term recognition experiments 


Experiment Coverage (%) | Precision (%) 
TE baseline 36 48.6 
TE baseline long | 38.5 59.7 
TE, 47 50.0 
TE: long 41.5 59.0 
TE2 30.5 54.1 
TE long 37.5 65.3 
TEs long 37.5 65.3 
TEs 45 56.7 
TE, long 30.5 62.3 
TEs 21.5 30.2 
TEs long 21.5 30.2 


being an adjective), then the requirements for frequency and term embeddedness 
expected by ComboBasic will be less likely to consider greetings as terms. 

The drawback of using ComboBasic, on the other hand, is that it fails to 
retrieve terms of different lengths. In fact, only the experiment TEs retrieved a 
mix of single and multi-word terms. Also, other irrelevant terms that could have 
been removed by filtering out common out-of-domain terms using the weirdness 
feature (TE; long) did not have the expected result. The interpretation we give 
is that Wikipedia is not the most suitable corpus for this use case. Instead, in 
future work, we would like to experiment with customer service data in other 
domains so that we could remove terms that are common to customer service 
domain in general while keeping those that are specific to customer service in 
the financial domain. The difficulty lies in the availability of such datasets. 


6.2 Taxonomy Construction 


The objective of our taxonomy construction experiments are twofold: (i) to eval- 
uate the combination of likelihood score and search methods that generate the 
best taxonomy structure; and (ii) to verify the impact of using an in-domain 
taxonomy as training data for the pairwise scoring function. 

In order to separate the error generated by the automatic term recognition 
from the results of the taxonomy construction, we did not use the terms extracted 
previously. Instead, three manually constructed financial taxonomies provided by 
Fidelity Investments (financial products, financial sectors, and financial topics) 
were used to inform the terms to be used in each experiment. The products 
taxonomy was used to train the in-domain model for pairwise scoring while the 
remaining taxonomies were used as gold standard for evaluation. 

First, two pairwise scoring models were trained using LibSVM [12], a library 
for support vector classification. The first model is trained on an out-of-domain 
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taxonomy (food domain) and background corpus provided by the TexEval chal- 
lenge [10]. The second model is trained using the products taxonomy and the 
pages in the Fidelity.com website as a corpus for extraction of features required 
by the Pairwise Scoring algorithm. All relationships between terms (c, d) appear- 
ing in the training taxonomy are given as positive examples (+1) and any other 
combination of terms is a negative example (—1). As the negative class sample is 
very large we perform negative sampling, with a fixed ratio of 5 negative exam- 
ples to each positive one. The pairwise scoring uses the probability estimation 
in LibSVM by returning the probability that the class is +1 as p(c E d). 

The workflow for comparison of taxonomy construction configurations using 
the different models is the following: 


— Step1. The term extraction algorithm was used to extract from the chatlog 
dataset the frequency of each term in the gold standard Teorp taxonomy. 

— Step2. The taxonomy construction algorithm was applied using the term fre- 
quencies from step 1, varying the configuration (Table3) and model to be 
evaluated. 


Table 3. Configuration of each taxonomy construction experiment 


Likelihood score 


Search | Transitive | Non-transitive | Bhattacharyya-Poisson 
Greedy TAL TAs TA3 
Beam | TA4 TAs TAs 


The unit of analysis for our evaluation is each pair of concepts x and y where 
there is a relation x E y. Note that transitivity was not taken into consideration, 
therefore only direct connection between terms was considered. The results of 
each run were evaluated using link precision as described in Eq. 14, where T is 
the resulting taxonomy provided by the taxonomy construction algorithm, and 
Tceoxp is the taxonomy provided as the expected result. 


l{(@Cy)e TA(e ly) € Teorp}| 


eC y) ET oe 


precision(T, Tgoip) = 


Results and Discussion. The problem of taxonomy construction is very chal- 
lenging, and previous evaluations such as TexEval [10] have only reported preci- 
sion as high as 30%. One challenge is that the structure of multiple taxonomies 
in the same domain may vary considerably, therefore it is difficult to take advan- 
tage of the overall structure of one taxonomy when the best structure in another 
occasion may be completely different. Therefore, due to the multiple challenges 
in the automatic generation of a taxonomy structure (see [10]), a precision mea- 
sure of 20% can already be considered as a strong result. 
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From the perspective of the logical connection between terms in the taxon- 
omy, the best performing setting (Table 4) was the one using the Bhattacharyya- 
Poisson likelihood score function, greedy search strategy, and the in-domain 
model (using the products taxonomy as background knowledge). This setting 
consistently gave better results than all the others on the three taxonomies 
available. It is important to note, however, that only the sectors and topics tax- 
onomies were used as gold standard since the products taxonomy was the one 
used as training data. The results using the products taxonomy are displayed 
only to contrast the impact of using an in-domain taxonomy versus using an out- 
of-domain one to train the pairwise scoring model. In fact, the results suggest 
that the choice of likelihood score and search method have a higher contribution 
to the quality of the final taxonomy than the domain of the taxonomy used 
to train the pairwise scoring algorithm. Therefore, we infer that the taxonomy 
construction framework can be successfully applied to other customer service 
domains where there is no background taxonomy to train the pairwise scoring 
model. 


Table 4. Precision (%) of taxonomy construction algorithm 


Search method | Likelihood score function | Out-of-Domain model In-Domain model 
Products | Sectors | Topics | Products | Sectors | Topics 

Greedy Transitive 16.86 11.25 11.41 7.28 8.75 7.38 
Non-Transitive 17.62 12.50 11.41 | 21.84 15.00 13.42 
BP 17.62 12.50 12.08 | 22.22 16.25 13.42 

Beam Transitive 4.21 1.25 1.34 3.83 1.25 0.67 
Non-Transitive 16.86 11.25 10.74 | 16.48 8.75 8.72 
BP 18.01 12.50 10.74 | 15.33 8.75 9.40 


In general, the pairwise scoring model is just one element that impacts on 
the final taxonomy built from text. In some cases its use provided better results 
and in some cases not. Overall, the choices of likelihood score function and 
search strategy had a higher impact on the quality of the final taxonomy than 
the taxonomy provided for pairwise scoring training. Nonetheless, the use of 
a domain taxonomy as background knowledge showed between 10% and 25% 
improvement in the precision when using the non-transitive score or BP functions 
with the greedy search. 


7 Conclusion and Future Work 


In this paper, we presented a solution for the automatic extraction of taxonomies, 
motivated by its use by conversational agents, and applied this solution to an 
anonymised customer service dialogue data provided by Fidelity Investments. 
We evaluated multiple methods for automatic term recognition, where Com- 
boBasic was the most suitable term scoring function for the dataset used. 
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Also, we introduced multiple functions to evaluate the likelihood that a tree 
structure is a taxonomy and evaluated their efficacy in taxonomy extraction. 
Furthermore, our results suggest that despite our approach benefiting from in- 
domain data, it does not require the taxonomy used for training to be in the 
same domain of the business, which makes our solution applicable to customer 
service domains where a manually created taxonomy is not available. 

In future work, we plan to explore: the use of customer service text datasets 
in other domains as background knowledge to remove greetings and other out- 
of-domain terms, the design of likelihood scores that take into consideration the 
desired graph structure of the taxonomy, and the application of the extracted 
taxonomy in a conversational agent. 
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Abstract. Ontology alignment plays a key role in achieving interoper- 
ability on the semantic Web. Inspired by the success of word embed- 
ding techniques in several NLP tasks, we propose a new ontology 
alignment approach based on the combination of word embedding and 
the radius measure. We tested our system on the OAEI (http://oaei. 
ontologymatching.org/) conference track and then applied it to align- 
ing ontologies in a real-world case study. The experimental results show 
that using word embedding and the radius measure make it possible to 
determine, with good accuracy, not only equivalence relations, but also 
hierarchical relations between concepts. 


Keywords: Ontology alignment - Word embedding 


1 Introduction 


The Silex! company develops a SaaS sourcing tool for the identification of the 
service providers that are best suited to meet some service requests. The Silex 
platform allows companies to provide a textual description of their professional 
activities, their offers and the services they are looking for. The work presented 
in this paper has been carried out in the context of a collaboration between 
Silex and the 13S research laboratory, to add a semantic layer to the Silex 
B2B platform, in order to be able to automatically process the descriptions 
of service requests and improve the recommendation of relevant providers. An 
ontology engineering work has been conducted to semantically annotate the 
text descriptions of companies, offers, and service requests, with three kinds 
of knowledge: skills, occupations, and business sectors. We developed the Silex 
ontology by combining several meta-data repositories: ESCO,” ROME,’ Cigref,* 


1 https: //www.Silex-france.com/Silex/. 
? https: //ec.europa.eu/esco/portal/home. 
3 http://www.pole-emploi.org /accueil/mot-cle.html?taglId=94b2eaf6-d7bd-4244- 
bddc-01415605563b. 
t http://cigref.hr-ingenium.com /accueil.aspx. 
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NAF,° UNSPSC®, Kompass’ and an internal Silex business sectors repository. 
Currently, the Silex ontology covers only the Computer Science (CS) field [1]. 
Our aim now is to automatically align the entire vocabularies to extend the Silex 
ontology to all business sectors. 

In this paper, we present a new approach to ontology alignment based on 
word embedding and inspired by an existing proposals [6]. We consider word 
embedding to represent concepts and we use it to compute not only equiva- 
lence relations between concepts but also hierarchical relations. We report our 
experiments on several open datasets from the Ontology Alignment Evaluation 
Initiative (OAEI) benchmark and the Silex use case. 

This paper is organized as follows: related work is discussed in Sect. 2. 
Section 3 describes our algorithm for ontology alignment. Section 4 reports and 
discusses the results of our experiments on the Silex use case. Section 5 draws 
some conclusions and discusses our perspectives as future work. 


2 Related Work 


The main issue when using several ontologies is to deal with their semantic 
heterogeneity when combining them: each ontology has its own designer, its own 
knowledge area and its own level of details. Ontology alignment is thus a crucial 
yet difficult task to achieve interoperability on the semantic Web. It aims to 
discover the correspondences between the entities of different ontologies, and 
express them as equivalence or hierarchical relations. 

There are two main ontology alignment techniques [2]: (i) Element-level tech- 
niques are meant to discover correspondences by calculating the surface similar- 
ity between lexical information of entities (usually labels), (ii) Structure-level 
techniques rely on the analysis of the neighbourhood of two entities in order to 
determine their similarity. Both techniques suffer from their weakness in captur- 
ing the semantics of lexical information of entities, and have been extended 
by exploiting external information sources, such as WordNet or Wikipedia. 
However, these auxiliary resources still suffer from the incompleteness and non 
exhaustiveness of their entries. To overcome this problem, the approach pre- 
sented in [6] uses word embedding to preserve the semantic and syntactic simi- 
larities between words. This work mainly extract the lexical information (names, 
labels and comments of an entity) and search equivalence relations between this 
informations based on word embeddings similarity. In our work, we have been 
inspired by [6] to calculate the similarites between entities based only on their 
labels. We extended this approach by using cluster’s radius to find equivalence 
and hierarchical relations between concepts. 


5 https://www.insee.fr/fr/information/2406147. 
6 https: //www.unspsc.org/. 
T http: //www.kompass-international.com/Corporate/home.html. 
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3 Overview of Our Approach to Ontology Alignment 


Our alignment process is based on a set of rules exploiting the word embedding 
similarity to discover the alignment. Our process is divided into four successive 
steps described in the following subsections. Our system supports two types of 
input (OWL ontologies and SKOS vocabulary), and two languages (French and 
English). But we can’t work with both languages at the same time as we have a 
different word embedding model per language. 


3.1 Extracting Lexical and Structural Information from Ontologies 


We started by extracting two types of information from inputs: (i) lexical infor- 
mation (e.g., labels of concepts) and (ii) structural information (e.g., to associate 
the labels of all child entities to their parent entities). To achieve this, the two 
inputs (OWL or SKOS) are parsed with rdflib and queried with a SPARQL 
query. The Listing 1.1 shows an example of queries that handle with SKOS 
input and french language. The same query is used for owl ontologies by replac- 
ing rdfs:label instead of skos:prefLabel to extract the label of the class or the 
properties, and rdfs:subClass or rfs:subproperties instead of skos:broader to get 
the hierarchical relation between classes or properties. 


Listing 1.1. SPARQL query to extract lexical and structural information from skos 
vocabulary 
SELECT ?uri ?label 
(group_concat (DISTINCT ?mid_label; separator=":") 
AS ?lineage) 
WHERE { 
?uri skos:prefLabel ?label FILTER (lang(?label)=’fr’) 
?uri “~skos:broaderx ?mid. ?mid skos:prefLabel ?mid_label. 
FILTER (lang (? mid_label)=’ fr’ ) 
} GROUP BY ?mid ORDER BY count(? label) 


3.2 Computing Word Embedding Representations of Concepts 


The second step of our approach is to compute the vector representations of 
concepts. We used a pre-trained word vectors for French and English, learned 
using fast'Text.° The French model contains 1,152,449 tokens, and the English 
model contains one million tokens. Both of them are mapped to 300-dimensional 
vectors [3]. 

The vector representation of a concept is constructed by averaging the word 
embedding vectors along each dimension of all the terms contained in its label 
and occurring in the dictionary conceptW ordEmbedding(c) = + >;_, wi, where 
n is the number of words in the dictionary occurring in the label of a concept 
c and w; € R°°° denotes the word embedding vector of the ith word. If a term 


does not appear in the dictionary, it is just ignored. 


8 https: //fasttext.cc/docs/en/pretrained-vectors.html. 
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Fig. 1. Precision and recall as a function of the similarity threshold. 


In the case of structural information, the vector representation of a 
cluster is given by averaging the word embedding vector representation 
of the label of the root concept (which is itself an average) with the 
vector representations of its child concepts clusterWordEmbedding(cl) = 
DE conceptW ordEmbedding(c;), where k is the number of concepts in 
cluster cl. 


3.3 Searching for Matching Concepts 


We match every concept in the source ontology O, with the similar concept in 
the target ontology O2 using the cosine similarity between vector representations 
of concept and cluster. The correspondence is then added to the alignment list 
based on the similarity threshold. Our algorithm aims at collecting all the pos- 
sible correspondences between concepts. We empirically chose the threshold, by 
varying its value and calculating for each one the recall and precision measures. 
Figure 1 shows that an optimal trade-off of performance is achieved by setting 
the similarity threshold equal to 0.8. 


3.4 Refining the Nature of the Relationship Between Two Matching 
Concepts 


The result of the previous step is a list of matching concepts whose relationship 
must be made more precise. To link two concepts that are sufficiently similar, we 
used skos:closeMatch for SKOS and owl:sameAs for OWL. To define a hierarchi- 
cal mapping link between two concepts, we used skos:broader or skos:narrower 
for SKOS and rdfs:subClassOf or rdfs:subPropertiesOf for OWL. 

This relationship between two matching concepts is refined by comparing the 
radii of their respective embedding vector clusters formed mainly using structural 
information. The radius of a cluster is the maximum distance between all the 
vector representing the terms and the centroid. We define the radius of a cluster 
of concepts as the standard deviation of their cosine dissimilarity with respect to 


© [wal Pe 
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representation of the ithe concept in the cluster, N is the size of the cluster, and 
w € R3 is the centroid of the cluster, defined as W = + yo, wi. We suppose 
that the cluster whose result has the lowest average distance between a point 
and the centroid is in broader relation with the cluster which have the biggest 
radius. We decide of the relationship holding between two similar concepts by 
comparing their radii based on the following rules: 


|radius(C'1) — radius(C2)| < 0.1 > C1 closeMatch C2 (1) 


|radius(C1) — radius(C2)| > 0.1 > C1 narrowMatch C2 


2 
A C2 broadMatch C1 (2) 


4 Experiments 


To evaluate the effectiveness of our approach, we performed experiments on two 
alignment datasets: (i) Task-oriented complex alignment on conference organisa- 
tion and (ii) the Silex use case. The performances of our approach are measured 
by calculating precision, recall and F-measure [4]. 


4.1 Experiments on Task-Oriented Complex Alignment 
on Conference Organisation 


To validate the proposed approach, we experimented it on a conference complex 
alignment benchmark’, !° for ontology merging, which has been constructed 
within the framework of the OAEI. This data set contains 57 correspondences 
made on five owl ontologies. Following the evaluation process presented in [5], 
we have taken into account only the alignments that exist in the complex data 
set and we ignored the alignment of simple data set. We assume that if our 
system is able to find the correct match between a proposed list, we consider 
that the entire proposed list is correct. This decision is justified by the fact that 
our system was designed to support end-users by presenting a list of possible 
matches. We compared our matching results with the results of three state- 
of-the-art systems that were mentioned in [5]: Our system clearly outperforms 
the others on this benchmark, with a precision value equals to O.89 and recall 
value equals to 0.69 compared to 0.83, and 0.13 for the best state-of-the-art 
system. Many reasons can explain our result: (i) the cosine similarity between 
classes is much smaller, as a consequence this match gets discarded than the 
threshold (cosine similarity (‘chair main’, ‘demo chair’ = 0)). (ii) Our system 
is not designed to test hierarchical relations between two leaf nodes. This type 
of relationship must pass through the structural information to calculate the 
radius and, thus, infer the relationship. (iii) Based on Eq.1, our system can 
assign equivalence relation instead of hierarchical relation because the threshold 
of the difference of radius between two classes is smaller than 0.1. 

? Thieblin, Elodie (2019): Task-oriented complex alignments on conference organisa- 


tion. figshare. Dataset. 
10 https: //doi.org/10.6084/m9.figshare.4986368.v8. 
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4.2 Experiments on the Silex Use Case 


The second data set used in this evaluation is the vocabularies gathered for 
the Silex use case in the CS field: we tried to match (i) ESCO (160 concepts 
to represent occupations) to Cigref (42 concepts), (ii) ESCO to ROME (117 
concepts), (iii) NAF to kompass (574 concepts) and (iv) NAF to Silex activity 
domains (14 concepts). A gold standard of each matching case was provided by 
an expert in the Silex company. Depending on the vocabularies to be aligned, the 
precision value ranges between (i) 0.71 and 0.8 for the closeMatch relation, (ii) 
0.7 and 0.83 for the narrowMatch relation and (iii) 0.73 and 1 for the broadMatch 
relation. On the other hand, the recall value ranges between (i) 0.6 and 0.95 for 
the closeMatch relation, (ii) 0.69 and 1 for the narrowMatch relation and (iii) 0.68 
and 1 for the broadMatch relation. For example, the ROME concept “computer 
developer” is stated to be broader than the ESCO concept of “Applications 
programmers” which is in broad relation with the ESCO concept of “Usability 
designer”, “System programmer”, “System developer”. 


5 Conclusion 


In this paper, we reported the results of a novel ontology alignment method, 
capable of distinguishing between equivalence and hierarchical relationships. Our 
first challenge was to answer on the real-world use case encountered by the Silex 
company. These results show that the proposed approach to ontology alignment 
based on a vector representation of the concepts to be matched is promising. As 
future work, we aim at defining a specific set of pre-trained word vectors that 
best covers the Silex B2B use case. We also aim at performing an empirical study 
to define the optimal threshold for radius difference. 
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Abstract. The difficulty of representing and organizing knowledge in reason- 
ably complete ways raises at least two research questions: “how to check that 
particular relations are systematically used not just whenever possible but 
whenever relevant for knowledge providers?” and “how to extend best practices, 
ontology patterns or methodologies advocating the systematic use of particular 
relations and, at the same time, automatize the checking of compliance with 
these methods?”. As an answer, this article proposes a generic “ontology design 
tule” (ODR). A general formulation of this generic ODR is: in a given KB, for 
each pair of knowledge base objects (types or individuals) of a given set chosen 
by the user of this ODR, there should be either statements connecting these 
objects by relations of particular given types or statements negating such rela- 
tions. This article further specifies this ODR and shows its interests for subtype 
relations and other transitive relations, e.g. part relations and specialization 
relations with genus & differentia. This article shows how this ODR can be 
implemented via OWL and SPARQL, at least for common simple cases (and, 
generically, via an higher-order logic based language). 


Keywords: Ontology design patterns - Ontology completeness - OWL - 
SPARQL 


1 Introduction 


Representing and organizing knowledge within or across knowledge bases (KBs) is a 
fundamental and difficult task for knowledge sharing and inferencing, and thereby for 
knowledge retrieval and exploitation. At least three kinds of research avenues (relevant 
to refer to in this article) guide this task. The first are ontologies made for reuse 
purposes (with methodologies implicitly or explicitly based on these ontologies, e.g. 
the Ontoclean methodology): foundational ontologies such as DOLCE and BFO; task- 
oriented ones such as OWL-S; general ones such as DBpedia and Schema.org; domain- 
oriented ones such as those from BioPortal. The second are catalogs of best practices 


© The Author(s) 2019 
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[1, 2] and ontology patterns [3, 4] or anti-patterns [5, 6]. The third are ontology/KB 
evaluation criteria and measures [7], e.g. for knowledge connectedness, precision, 
consistency, conciseness and completeness. The results of these three kinds of research 
avenues are especially helpful for building reusable ontologies. 

These three kinds of research avenues advocate the use of relations of particular 
types between objects of particular types. In the RDF terminology, one would say that 
these three kinds of research avenues advocate the use of properties to connect 
resources of particular classes — e.g. the use of subClassOf or equivalentClass relations 
between classes or other objects, whenever this is relevant. However, often, only a 
knowledge provider knows when it is relevant to use a particular property. This limits 
the possibilities of checking or guiding the use of the advocated properties. Further- 
more, it may also be useful that the knowledge provider represents when the advocated 
properties do not or cannot occur. E.g., representing disjointWith or complementOf 
relations between classes to express that subClassOf or equivalentClass relations 
cannot occur between these classes has many advantages that Sect. 2 illustrates. Using 
all these relations is especially useful between top-level classes since many inference 
engines can exploit the combination of these relations, e.g. via inheritance mechanisms. 
Finally, checking that particular relations are represented as either existing or forbidden 
can be done automatically. Thus, as an answer to the research questions “how to check 
that particular relations are systematically used not simply whenever this is possible but 
whenever this is relevant for the knowledge providers?” and “how to extend best 
practices, ontology patterns or methodologies that advocate the systematic use of 
particular relations, and make the compliance with these methods easier to check?”, 
this article proposes the following generic “ontology design rule” (ODR). A first 
general formulation of this generic ODR is: in a given KB, for each pair of objects of a 
given set chosen by the user of this ODR, there should be either statements connecting 
these objects by relations of particular given types or statements negating such rela- 
tions, i.e. expressing that these relations do not or cannot occur in the given KB. 
A negated relation can be expressed directly via a negated statement or indirectly, e.g. 
via a disjointWith relation that forbids the existence of such a relation. 

In its more precise version given in the next page, we call this ODR the “com- 
parability via particular relation types” ODR, or simply the “comparability ODR”. We 
call it an ODR, not a pattern nor a KB evaluation criteria/measure because this is 
something in between. As above explained, it is always automatically checkable. It is 
also reusable for evaluating a KB for example by applying it to all its objects and 
dividing the number of successful cases by the number of objects. An example of KB 
evaluation criteria that can be generalized by a reuse of this ODR is the “schema-based 
coverage” criteria of [1] which measures the percentage of objects using the relations 
that they should or could use according to schemas or relation signatures. Examples of 
methodologies, best practices or ontology patterns that can be generalized via the use of 
this ODR are those advocating the use of tree structures or of genus & differentia when 
organizing or defining types. (Sections 2.3 and 3.3 detail this last point.) 

Before formulating this ODR more precisely, it seems interesting to further detail its 
application to the OWL properties subClassOf or equivalentClass — along with those 
that negate or exclude them, e.g. disjointWith and complementOf. Using all these 
properties whenever relevant, as this applied ODR encourages, will for example lead the 
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authors of a KB to organize the direct subtypes of each class — or at least each top-level 
class — into “complete sets of exclusive subtypes” (each of such sets being a subclass 
partition, or in other words, a disjoint union of subtypes equivalent to the subtyped 
class), and/or “incomplete sets of exclusive subtypes”, and/or “(in-)complete sets of 
subtypes that are not exclusive but still different and not relatable by subClassOf rela- 
tions”, etc. The more systematic the organization, the more a test of whether a class is 
subClassOf_or_equivalent (i.e. is subClassOf, equivalentClass or sameAs) another class 
will lead to a true/false result, not an “unknown” result. In other words, the more such a 
test will lead to a true/false result without the use of “negation as failure” (e.g. via the 
“closed-world assumption”: any statement not represented in the KB is considered to be 
false) or the use of the “unique name assumption” (with which different identifiers are 
supposed to refer to different things). Since most inferences are based on such 
subClassOf_or_equivalent tests, the more systematic the organization, the more infer- 
ences will be possible without having to use negation as failure. This is interesting since 
using negation as failure implies making an assumption about the content of a KB 
whereas adding subClassOf or disjointWith relations means adding information to a KB. 

The next two sections illustrate some of the many advantages of the more sys- 
tematic organization resulting from the application of this ODR: for inferencing or 
querying, for avoiding what could have otherwise been implicit redundancies or 
inconsistencies and, more generally, for improving the completeness, consistency and 
precision of a KB. These advantages are not restricted to subClassOf_or_equivalent 
relations. They apply to all specializationOf_or_equivalent relations, i.e. specializa- 
tionOf relations (they generalize subClassOf relations), equivalence relations or 
sameAs relations. As we shall see, most of these advantages also apply to other 
transitive relations such as partOf_or_equivalent (i.e. isSubPartOf, equivalentClass or 
sameAs). We call “speciali-zation of an object” any other object that represents or 
refers to more information on the same referred object. This covers all subtype relations 
but also specialization relations between individuals, e.g. between the statements “some 
cars are red” and “John’s car is dark red”. We call “statement” a relation or a set of 
connected relations. 

We adopt the following “comparability” related definitions. Two objects are 
“comparable via a relation of a particular type” (or, more concisely, “comparable via a 
particular relation type”) if they are either identical (sameAs), equivalent (by intension, 
not extension) or connected by a relation of this type. Two objects are “uncomparable 
via a relation of a particular relation type” (or, more concisely, “uncomparable via a 
particular property”) if they are different and if some statement in the KB forbids a 
relation of this type between these two objects. Given these definitions, the compa- 
rability ODR can be defined as testing whether “each object (in the KB or a part of the 
KB selected by the user of this ODR) is defined as either comparable or uncomparable 
to each other object (or at least some object, if the user prefers) via each of the tested 
relation types”. In a nutshell, the comparability ODR checks that between particular 
selected objects there is “either a comparability or an uncomparability via particular 
relations”. This ODR does not rely on particular kinds of KBs or inference engines but 
powerful engines may be relevant for checks if they infer more relations. 

Stronger versions of this ODR can be used. E.g., for a more organized KB, some 
users may wish to have “either comparability or strong uncomparability” via relations 
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of particular types between any two objects. Two objects are “strongly uncomparable 
via a relation of a particular type” if they are different and if some statement in the KB 
forbids the existence of a relation of this type between the two objects as well as 
between their specializations. E.g., disjoint classes are strongly uncomparable since 
they cannot have shared instances or shared subclasses (except for owl : Nothing). 

A more general version of this ODR could also be defined by using “equivalence 
by intension or extension” instead of simply “equivalence by intension”. In this article, 
“equivalence” means “equivalence by intension” and “specialization” is also “spe- 
cialization by intension”. This article also assumes that the equivalence or specializa- 
tion relations (or their negations) which are automatically detectable by the used 
inference engine are made explicit by KB authors and thus can be exploited via 
SPARQL queries. In a description logics based KB, this can be achieved by performing 
type classification and individual categorization before checking the ODR. 

Figure | shows a simple graphic user interface for selecting various options or 
variants for this ODR. With the shown selected items (cf. the blue items in Fig. 1 and 
the words in italics in the rest of this sentence), this interface generates a function call 
or query to check that each object (in the default KB) which is instance of owl:Thing is 
either comparable or uncomparable via specialization relations and part relations to 
each other object in the default KB. Figure 1 shows a function call. After the con- 
version of its last three parameters into more formal types, a similar call can be made to 
a generic function. [8] is an extended version and on-line companion article for this 
one. In its appendix, [8] defines this generic function and the types it exploits. To 
achieve this, these definitions are written in a higher-order logic based language. 


Evaluation of Comparability Between Objects 
From+to objects Via relations With this kind of comparability relation type 
of this class of these types comparability_or uncomparability 
relation comparability_or_strong-uncomparability 
Type transitive_relation comparability-or-uncomparability_with-minimal-differentia 
Class connectability_or_un-connectability 
Property subtype R ý 
Individual subClass From each object to EI other object 
Statement subProperty some 
Non-stmt_indiv N R Between objects METIE 
z sii in the KB and those reachable from the KB 
non-transitive_relation . G z 
E KB: © default KB http://... 
Final command (updatable directly or via the above menus) SPARQL query SHACL ‘*Function 
comparability evaluation (Thing, {specialization, part}, comparability_or_uncomparability, 
"each", “in the KB", "default KB") 


Fig. 1. A simple interface for object comparability/connectability evaluation (Color figure 
online) 
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With the comparability_or_uncomparability option (hence with the comparability 
ODR), equivalence or sameAs relations are always exploited in addition to the spec- 
ified relations. When it is not relevant to also exploit equivalence or sameAs relations, 
the connectability_or_un-connectability option shown in Fig. | should be selected. 

The next two sections show the interests of this ODR for, respectively, (i) subtype 
relations, and (ii) other relations, e.g. part relations and specialization relations with 
genus & differentia. When relevant, these sections present type definitions in OWL, as 
well as SPARQL queries or update operations, to illustrate how this ODR can be 
implemented. Figure 1 shows that SHACL (a constraint language proposed by the 
W3C) may also be exploited when its expressiveness is sufficient to express the con- 
straint that needs to be represented. However, this exploitation is not described in this 
article. Section 4 provides more comparisons with other works and concludes. 


2 Comparability of Types via Subtype Relations 


2.1 Representation via OWL 


In this document, OWL refers to OWL-2 (OWL-2 DL or OWL-2 Full) [9] and OWL 
entities are prefixed by “owl:”. All the types that we propose in this article are in 
http://www.webkb.org/kb/it/o_knowledge/d_odr_content/sub/ and the “sub” names- 
pace is here used to abbreviate this URL. Unless otherwise specified, the syntax used 
for defining these types is Turtle, and the syntax used for defining queries or update 
operations is SPARQL. SPARQL uses Turtle for representing relations. For clarity 
purposes, identifiers for relation types have a lowercase initial while other identifiers 
have an uppercase initial. 

To illustrate the interest of representing exclusion relations between classes — and, 
more generally, of the interest of making types “uncomparable via subClassOf rela- 
tions” whenever possible — here is an example in two parts. The first part is composed 
of the following RDF+OWL/Turtle statements. They do represent any exclusion 
relation. They represent a few relations from WordNet 1.3 (not the current one, 
WordNet 3.1). According to these relations, Waterloo is both a battle and a town, any 
battle is a (military) action, any town is a district, and any district is a location. 


wn:Waterloo rdf:type wn:Battle, wn:Town. 
wn:Battle rdfs:subClassOf wn:Military action. 
wn:Military action rdfs:subClassOf wn:Action. 


wn:Town rdfs:subClassOf wn:District. 
wn:District rdfs:subClassOf wn:Location. 


Now, as a second part of the example, a disjointWith relation is added between two 
top-level classes: the one for actions and the one for locations. This exclusion relation 
between actions and locations has not been made explicit in WordNet but is at least 
compatible with the informal definitions associated to categories in WordNet. Given all 
these relations, an OWL inference engine (that handles disjointWith relations) detects 
that the categorization of Waterloo as both a battle and a town is inconsistent. As 
illustrated in Sect. 2.3, many other possible problems in WordNet 1.3 were similarly 
detected. Most of them do not exist anymore in the current WordNet. 
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wn:Action owl:disjointWith wn:Location. 


OWL DL is sufficient for representing statements implying that particular classes 
are “comparable via subClassOf (relations)” or “strongly uncomparable via sub- 
ClassOf’. For this second case, which amounts to state that two classes are disjoint, the 
properties owl:Al1DisjointClasses, owl:complementOf, owl:dis- 
jointWith and owl:disjointUnionoOf can be used. OWL Full [9] is necessary 
for setting owl: differentFrom relations between classes, and hence, as shown in 
the next page, for defining the property sub: different_and_not_subClassOf 
as a sub-property of owl:differentFrom. In turn, this property is necessary for 
representing statements implying that particular classes are weakly uncomparable, i.e. 
uncomparable but not strongly uncomparable (hence not disjointWith nor comple- 
mentOf). OWL Full is also necessary for defining the properties sub: differ- 
ent_and_not_equivalentClass and sub:proper-subClassoOf (alias, 
sub:subClassOf_and_not-equivalentClass). With all the above cited 
types, it is possible for KB authors to express any relationship of “comparability or 
uncomparability via subClassOf’. 

OWL inference engines generally cannot exploit OWL Full and hence do not 
enforce nor exploit the semantics of definitions requiring OWL Full. When inference 
engines do not accept OWL Full definitions, the above cited “sub:” properties have to 
be solely declared (as being properties) instead of being defined via relations (hence by 
a logic formula). However, when inference engines do not accept or do not exploit 
OWL Full definitions, the loss of inferencing possibilities due to the non-exploitation 
of the above cited “sub:” properties is often small. When the goal is simply to detect 
whether the comparability ODR is followed, if the SPARQL query proposed in the 
next subsection is used to achieve that goal, it does not matter whether the above cited 
“sub:” properties are declared or defined. 

Making every pair of classes in a KB comparable or uncomparable via subClassOf 
is cumbersome without the use of properties that create (in-)complete sets of (exclu- 
sive) subclasses. We propose such properties, e.g. sub:complete_set_of_ 
uncomparable-subClasses, sub: incomplete_set_of_uncompara- 
ble-subClasses and  sub:proper-superClassOf_uncomparable_ 
with_its_siblings. Such complex properties cannot be defined in OWL. 
However, as illustrated below, SPARQL update operations can be written to replace the 
use of these complex properties by the use of simpler properties that OWL inference 
engines can exploit 
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sub:proper-subClassOf rdfs:subPropertyOf rdfs:subClassOf; 
owl:propertyDisjointWith owl:equivalentClass 
#a "proper subClass" is a "Strict subClass" (a direct or indirect one) 


sub: proper-subPropertyOf rdfs:subPropertyOf rdfs:subPropertyOf; 
owl:propertyDisjointWith owl:equivalentProperty 


sub: different _and_not_subClassOf rdfs:subPropertyOf owl:differentFrom; 
owl:propertyDisjointWith rdfs:subClassOf 


sub:different_and_not_equivalentClass rdfs:subPropertyOf owl:differentFrom; 
owl:propertyDisjointWith owl:equivalentClass 


sub: proper-superClassOf owl:inverseOf sub:proper-subClassOf 
sub: proper-superClassOf uncomparable with_its siblings 
rdfs:subPropertyOf sub:proper-superClassOf . #partial definition only 


#Example of a SPARQL update operation to replace the use of 
# sub:proper-superClassOf uncomparable with_its siblings relations 
# by simpler relations: 
DELETE 
{ ?c sub:proper-superClassOf uncomparable with_its_ siblings ?scl, ?sc2 } 
INSERT { ?c sub:proper-superClassOf ?scl, ?sc2 
?scl sub:different_and_not_subClassOf ?sc2 
?sc2 sub:different_and_not_subClassOf ?scl } 
WHERE{?c sub:proper-superClassOf uncomparable with_its siblings ?scl, ?sc2 
FILTER (?scl != ?sc2) } 


Similarly, to state that particular properties are (strongly or at least weakly) “un- 
comparable via rdfs:subPropertyOf relations”, OWL DL is sufficient. For 
strong uncomparability, owl:propertyDisjointWith relations can be used. 
Defining that particular properties are only weakly uncomparable, i.e. uncomparable 
but not strongly uncomparable, is possible in OWL Full, exactly as for subClassOf 
relations: to define these properties, it is sufficient to replace every occurence of “class” 
by “property” in the above code. As for classes too, if these “sub:” properties are only 
declared instead of being defined, the loss of inferencing possibilities is small. 


2.2 Checking via SPARQL 


Using SPARQL (1.1) [10] to check the “comparability of classes via subClassOf 
relations” means finding each class that does not follow this ODR, i.e. that each class 
that is neither comparable nor uncomparable via subClassOf relations to each/some 
other class in selected KBs (“each/some” depending on what the user wishes to test). 

The next page shows a SPARQL query for the “each other class” choice, followed 
by a SPARQL query for the “some other class” choice. In any case, if instead of the 
“comparability_or_uncomparability” option (the default option selected in Fig. 1), the 
user prefers the “comparability_or_strong-uncomparability” option, the two lines about 
sub:different_and_not_subClassOf relations should be removed. For the 
“connectability_or_un-connectability” option, the line about owl:equiva- 
lentClass and owl:sameAs relations should instead be removed. 
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SELECT distinct ?cl ?c2 WHERE #query for the "each other class" choice 
{ ?cl a owl:Class. ?c2 a owl:Class. FILTER(?cl != ?c2) 


#Skip comparable objects (here, classes comparable to ?cl): 
FILTER NOT EXISTS{ ?cl rdfs:subClassOf|*rdfs:subClassOf ?c2 } 
FILTER NOT EXISTS{ ?cl owl:equivalentClass|owl:sameAs ?c2 } 


#skip strongly uncomparable objects: 

FILTER NOT EXISTS{ ?cl owl:complementOf|owl:disjointWith ?c2 } 

FILTER NOT EXISTS{ [] rdf:type owl:AllDisjointClasses; 
owl:members/rdf:rest*/rdf:first ?c1l,?c2 } 

FILTER NOT EXISTS{ [] owl:disjointUnionOf/rdf:rest*/rdf:first ?cl,?c2 } 


#skip remaining uncomparable objects that are only weakly uncomparable: 
FILTER NOT EXISTS { ?cl owl:differentFrom ?c2 } 
} #no need to use sub:different_and_not_subClassOf here since, at this 

# point, subClassOf relations have already been filtered out 


SELECT distinct ?cl WHERE #query for the "some other class" choice 
{ ?cl a owl:Class. #for each class ?cl 

#Skip comparable objects (here, classes comparable to ?cl): 

FILTER NOT EXISTS{?cl rdfs:subClassOf |owl:equivalentClass|owl:sameAs ?c2 
FILTER ((?cl!=?c2) && (?c2!=owl:Nothing)) } 


#Skip strongly uncomparable objects: 
FILTER NOT EXISTS{ ?cl owl:complementOf|owl:disjointWith ?c2 
FILTER ((?cl!=?c2) && (?c2!=owl:Nothing)) } 


FILTER NOT EXISTS{ [] rdf:type owl:AllDisjointClasses; 
owl:members/rdf:rest*/rdf:first ?cl,?c2 } 
FILTER NOT EXISTS{ [] owl:disjointUnionOf/rdf:rest*/rdf:first ?cl,?c2 } 


#Skip remaining uncomparable objects that are only weakly uncomparable: 
FILTER NOT EXISTS { ?cl owl:differentFrom ?c2 } 


Checking the “comparability of properties via subPropertyOf relations” is similar to 
checking the “comparability of classes via subClassOf relations”. The above SPARQL 
query can easily be adapted. The first adaptation to make is to replace every occurence 
of “class” by “property”, to replace “disjointWith” by “propertyDisjointWith” and to 
replace “complementOf” by “inverseOf’. The second adaptation to make is to remove 
the lines about “AlIDisjointClasses” and “disjointUnionOf” since in OWL these types 
do not apply to properties and have no counterpart for properties. 


Dealing with Several Datasets. A KB may reuse objects defined in other KBs; object 
identifiers may be URIs which refer to KBs where more definitions on these objects can 
be found. We abbreviate this by saying that these other KBs or definitions are reachable 
from the original KB. Similarly, from this other KB, yet other KBs can be reached. One 
feature proposed in Fig. 1 is to check all objects “in the KB and those reachable from 
the KB”. Since comparability checking supports the detection of particular inconsis- 
tencies and redundancies (cf. next subsection and next section), the above cited feature 
leads to the checking that a KB does not have particular inconsistencies or redun- 
dancies with the KBs reachable from it. This feature does not imply fully checking 
these other KBs. The above presented SPARQL query does not support this feature 
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since it checks classes in the dataset of a single SPARQL endpoint. Implementing this 
feature via SPARQL while still benefiting from OWL inferences unfortunately requires 
the SPARQL engine and the exploited OWL inference engine to work on a merge of all 
datasets reachable from the originally queried dataset. For small datasets, one way to 
achieve this could be to perform such a merge beforehand via SPARQL insert oper- 
ations. However, when it is not problematic to give up OWL inferences based on 
knowledge from other datasets, an alternative is to use a SPARQL query where 
(i) “SPARQL services” are used for accessing objects in other datasets, and (ii) tran- 
sitive properties such as rdfs:subClassOf are replaced by property path expressions 
such as ““rdfs:subClassOf+”. 


2.3 Advantages for Reducing Implicit Redundancies, Detecting 
Inconsistencies and Increasing Knowledge Querying Possibilities 


Within or across KBs, hierarchies of types (classes or properties) may be at least 
partially redundant, i.e. they could be at least partially derived from one another if 
particular type definitions or transformation rules were given. Implicitly redundant type 
hierarchies, i.e. non-automatically detectable redundancies between type hierarchies, 
are reduced and easier to merge (manually or automatically) when types are related by 
subtypeOf_or_equivalent relations, e.g. subClassOf, subPropertyOf, equivalentClass or 
equivalentProperty relations. Using such relations is also a cheap and efficient way of 
specifying the semantics of types. 

Relating types by not_subtypeOf-or-equivalent relations — e.g. disjointWith or 
complementOf relations — permits the detection or prevention of incorrect uses of such 
relations and of instanceOf relations. These incorrect uses are generally due to someone 
not knowing some particular semantics of a type, because this someone forgot this 
semantics or because this semantics was never made explicit. The two-point list below 
gives some examples extracted from [11]. In this article, the author — who is also the 
first author of the present article — reports on the way he converted the noun related part 
of WordNet 1.3 into an ontology. Unlike for other such conversions, the goal was to 
avoid modifying the meanings the conceptual categories of WordNet as specified by 
their associated informal definitions and informal terms. The author reports that, after 
adding disjointWith relations between top-level conceptual categories which according 
to their informal definitions seemed exclusive, his tool automatically detected 230 
violations of these exclusions by lower-level categories. In the case of WordNet, what 
these violations mean is debatable since it is not an ontology. However, like all such 
violations, they can at least be seen as heuristics for bringing more precision and 
structure when building a KB. The authors of WordNet 1.3 were sent the list of the 230 
detected possible problems. Most of these possible problems do not occur anymore in 
the current WordNet (3.1). 


e Many of the 230 possible problems were detected via the added exclusion relations 
between the top-level category for actions and other top-level categories which 
seemed exclusive with it, based on their names, their informal definitions and those 
of their specializations. Via the expression “informal definition” we refer to the 
description in natural language that each WordNet category has. Via the expression 
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“categorized as” we refer to the generalization relations that an object has in 
WordNet. The above mentioned added exclusion relations led to the discovery of 
categories — e.g. those for some of the meanings of the words “epilogue” and 
“interpretation” — which were (i) categorized and informally defined as action 
results/attributes/descriptions, (11) seemingly exclusive with actions (given how they 
were informally defined and given they were not also informally defined as actions), 
and (iii) (rather surprisingly) also categorized as actions. Given these last three 
points, [11] removed the “categorization as action” of these action result/attribute/ 
description categories. Based on the content of WordNet 3.1, it appears that the 
authors of WordNet then also made this removal. 

e Other causes for the 230 violations detected via the added exclusion relations 
between top-level categories came from the fact that WordNet uses generalization 
relations between categories instead of other relations. E.g., instead of 
location/place relations: in WordNet 1.3, many categories informally defined as 
battles were classified as both battles and cities/regions (this is no more the case in 
WordNet 3.1). E.g., instead of member relations: in WordNet, the classification of 
species is often intertwined with the classification of genus of species. 


Several research works in knowledge acquisition, model-driven engineering or 
ontology engineering, e.g. [12-15], have advocated the use of tree structures when 
designing a subtype hierarchy, hence the use of (i) single inheritance only, and 
(ii) multiple tree structures, e.g. one per view or viewpoint. They argue that each object 
of the KB has a unique place in such trees and thus that such trees can be used as 
decision trees or ways to avoid redundancies, normalize KBs and ease KB 
searching/handling. This is true but the same advantages can be obtained by creating 
subtypes solely via sets of disjoint (direct) subtypes. Indeed, to keep these advantages, 
it is sufficient (and necessary) that whenever two types are disjoint, this disjointness is 
specified. With tree structures, there are no explicit disjointWith relations but the 
disjointness is still (implicitly) specified. Compared to the use of multiple tree struc- 
tures, the use of disjoint subtypes and multiple inheritance has the advantages of (i) not 
requiring a special inference engine to handle “tree structures with bridges between 
them” (e.g. those of [12, 16]) instead of a classic ontology, and (ii) generally requiring 
less work for knowledge providers than creating and managing many tree structures 
with bridges between them. Furthermore, when subtype partitions can be used, the 
completeness of these sets supports additional inferences for checking or reasoning 
purposes. The above rationale do not imply that views or tree structures are not 
interesting, they only imply that sets of disjoint subtypes are good alternatives when 
they can be used instead. 

Methods or patterns to fix (particular kinds of) detected conflicts are not within the 
scope of this article. Such methods are for example studied in the belief set/base 
revision/contraction as well as in KB debugging. [17] proposes an adaptation of base 
revision/debugging for OWL-like KBs. The authors of [18] have created ontology 
design patterns that propose systematic ways to resolve some particular kinds of 
inconsistencies, especially the violation of exclusion relations. 

As illustrated in Sect. 2.1, the OWL properties usable to express that some types 
are “comparable or uncomparable via subtypeOf”’ — e.g. subClassOf, subPropertyOf, 
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equivalentClass, equivalentProperty, disjointWith and complementOf relations — can 
be combined to define or declare properties for creating (un-)complete sets of (non-) 
disjoint subtypes or, more generally, for creating more precise relations which better 
support the detection of inconsistencies or redundancies. E.g., sub:proper- 
subClassOf can be defined and used to prevent unintended subClassOf cycles. 


Advantages for Knowledge Querying. Alone, subtypeOf_or_equivalent relations 
only support the search for specializations (or generalizations) of a query statement, i.e. 
the search for objects comparable (via subtype relations) to the query parameter. The 
search for objects “not uncomparable via specialization” to the query parameter — i.e. 
objects that are or could be specializations or generalizations of this parameter — is 
more general and sometimes useful. 


e Assume that a KB user is searching for lodging descriptions in a KB where sports 
halls are not categorized as lodgings but are not exclusive with them either, based 
on the fact that they are not regular lodgings but that they can be used as such when 
natural disasters occurs. Also assume that the user intuitively shares such views on 
lodgings and sports halls. Then, querying the KB for (specializations of) “lodgings” 
will not retrieve sports halls. On the other hand, querying for objects not uncom- 
parable to “lodgings” will return sports halls; furthermore, if lodgings have been 
defined as covered areas, such a query will not return uncovered areas such as open 
stadiums. Thus, assuming that the term “lodging” in this previous querying has 
been used because the author of the query was looking for covered areas only, this 
person will only get potentially relevant results. 

e More generally, when a person does not know which exact type to use in a query or 
does not know what kind of query to use — e.g. a query for the specializations or the 
generalizations of the query parameter — a query for objects “not uncomparable” to 
the query parameter may well collect all and only the objects the person is interested 
in, if in the KB all or most types are either comparable or uncomparable via subtype 
relations. 


The more systematically the types of a KB are comparable via subtype relations, the 
more the statements of the KB — as well as other if they have a definition — will be 
retrievable via comparability or uncomparability based queries. 


3 Other Interesting Cases of Comparability 


The previous section was about the comparability of types via subtype relations. This 
subsection generalizes the approach to other types of relations. 


3.1 Comparability via “Definition Element” Relations 


In this article, an object definition is a logic formula that all specializations of the object 
must satisfy. A full definition specifies necessary and sufficient conditions that the 
specializations must satisfy. In OWL, a full definition of a class is made by relating this 
class to a class expression via an owl: equivalentClass relation. Specifying only 
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necessary conditions — e.g. using rdfs:subClassOf instead of owl : equiva- 
lentClass — means making only a partial definition. An “element of a definition” is 
any target domain object which is member of that definition, except for objects of the 
used language (e.g. quantifiers and logical operators). A “definition element” relation is 
one that connects the defined object to an element of the definition. E.g., if a Triangle is 
defined as a “Polygon that has as part 3 Edges and 3 Vertices”, Triangle has as 
definition elements the types Polygon, Edge, Vertex and part as well as the value 3. The 
property sub: definition_element — one of the types that we propose — is the 
type of all “definition element” relations that can occur with OWL-based definitions. 
We have fully defined sub:definition_element in [8] based on the various ways defi- 
nitions can be made in OWL; one of its subtypes is rdfs:subClassOf. This 
subsection generalizes Sect. 2 since a definition may specify other relations than 
subClassOf relations, as illustrated by the above definition of Triangle. A “definition- 
element exclusion” relation is one that connects an object O to another one that could 
not be used for defining O. This property can be defined based on the “definition 
element” relation type. E.g.: 


sub:definition-element_exclusion #reminder: "has _" is implicit 
rdfs:subPropertyOf owl:differentFrom ; 
owl:propertyDisjointWith sub:definition_element ; 
owl:propertyDisjointWith [owl:inverseOf sub:definition element]. 


As explained in Sect. 2.3, checking that types in a KB are either comparable or 
uncomparable via subtype relations reduce implicit redundancies between type hier- 
archies. As illustrated by the later paragraph titled “Example of implicit potential 
redundancies”, this checking is not sufficient for finding every implicit potential 
redundancy resulting from a lack of definition, hence for finding every specialization 
hierarchy that could be derived from another one in the KB if particular definitions 
were given. However, this new goal can be achieved by generalizing the previous 
approach since this goal implies that for every pair of objects (in the KB or a selected 
KB subset), either one of these objects is defined using the other or none can be defined 
using the other. In other words, this goal means checking that for every pair of objects 
in the selected set, these two objects are either comparable or uncomparable via 
“definition element” relations. To express that objects are strongly uncomparable in this 
way — and hence not potentially redundant — “definition-element exclusion” relations 
can be used. 

The above cited new goal implies that, from every object, every other object in the 
KB is made comparable or uncomparable via “definition element” relations. This is an 
enormous job for a KB author and very few current KBs would satisfy this ODR. 
However, given particular reasons and techniques described in [8], a KB 
contributor/evaluator may choose to assume that for avoiding a good enough amount of 
implicit potential redundancies between type hierarchies, it is sufficient to check that 
from every object, at least one other object in the KB is made comparable or 
uncomparable via “definition element” relations (thus, using the “some other object” 
option given in Fig. 1, instead of the “every other object” option). As explained in [8], 
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this saves a lot of work to the KB contributors and may avoid generating a large 
number of “definition-element exclusion” relations. 


Example of Implicit Potential Redundancies. It is often tempting to specialize 
particular types of processes or types of physical entities according to particular types 
of attributes, without explicitly declaring these types of attributes and organizing them 
by specialization relations. E.g., at first thought, it may sound reasonable to declare a 
process type Fair_process without relating it to an attribute type Fairness (or Fair) via a 
definition such as “any Fair_process has as attribute a Fairness”. However, Fair_pro- 
cess may then be specialized by types such as Fair_process_for_utilitarianism, 
Fair_process_wrt_Pareto-efficiency, Fair_bargaining, Fair_distribution, Fair_distribu- 
tion_wrt_utilitarianism, Fair_distribution_for_prioritarianism, Fair_distribution_wrt_ 
Pareto-efficiency, etc. It soon becomes apparent that this approach is not relevant 
since (i) every process type can be specialized wrt. a particular attribute type or any 
combination of particular attribute types, and (ii) similar specializations can also be 
made for function types (e.g. starting from Fair_function) and attribute types (starting 
from Fairness). Even if the KB is not a large KB shared by many persons, many 
beginnings of such parallel categorizations may happen, without them being related via 
definitions. Indeed, the above example with process types and attribute relations to 
attributes types can be replicated with any type and any relation type, e.g. with process 
types and agent/object/instrument/time relation types or with physical entity types and 
mass/color/age/place relation types. 

Ensuring that objects are either comparable or uncomparable via “definition ele- 
ment” relations is a way to prevent such (beginnings of) implicitly potentially redun- 
dant type hierarchies: all/most/many of them depending on the chosen option and 
assumption. As with disjointWith relations, the most useful “definition-element 
exclusion” relations are those between some top-level types. To normalize definitions 
in the KB, e.g. to ease logical inferencing, a KB owner may also use “definition- 
element exclusion” relations to forbid particular kinds of definitions, e.g. forbid pro- 
cesses to be defined wrt. attributes or physical entities. Each definition for a type T sets 
“definition element” relations to other types, and these relations also apply to the 
subtypes of T. A special “definition element” relation type may also be used to reach 
not just the above cited other types but their subtypes too. Otherwise, most types would 
need to be defined if few “definition-element exclusion” relations are set between top- 
level types. 


3.2 Comparability via Other Transitive Relations, Especially Part 
Relations 


Ensuring that objects are either comparable or uncomparable via specialization rela- 
tions via specialization relations has many advantages which were illustrated in 
Sects. 2.3 and 3.1. Similar advantages exist with all transitive relations, not just spe- 
cialization relations, although to a lesser extent since less inferences — and hence less 
error detection — can be made with other transitive relations. 

Part properties — e.g. for spatial parts, temporal parts or sub-processes — are partial- 
order properties that are often exploited. Unlike subtype relations, they connect 
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individuals. Nevertheless, for checking the “comparability of individuals via part 
relations (let us assume sub:part relations)”, the SPARQL query given in Sect. 2.2 
can be adapted. Below is this adapted query for the “each other object” choice. The 
adaptation to make for the “some other object” choice is similar to the one in Sect. 2.2. 
Two objects that are “comparable via part relations” if one is fully part of the other (or 
if they are identical). They are “strongly uncomparable via part relations” if they do not 
share any part (and hence the respective parts of these two objects do not have shared 
parts either). Two objects that are “weakly uncomparable via part relations” share some 
parts but none is fully part of the other. 


SELECT distinct ?il ?i2 WHERE #individuals (as checked by the next 2 lines) 
{ ?il rdf:type ?cl. FILTER NOT EXISTS { ?il rdf:type owl:Class } 
?12 rdf:type ?c2. FILTER NOT EXISTS { ?i2 rdf:type owl:Class } 


#skip comparable objects: 
FILTER NOT EXISTS { ?il owl:sameAs|sub:part+| (*sub:part)+ ?1i2 } 


#skip strongly uncomparable objects: 
FILTER NOT EXISTS { ?i1 sub:part exclusion ?i2 } 


#skip remaining uncomparable objects that are only weakly uncomparable: 
FILTER NOT EXISTS { ?i1 owl:differentFrom ?i2 } #as in Section 2.2 


} #with: sub:part rdfs:subPropertyOf owl:differentFrom ; 
# rdf:type owl:TransitiveProperty . 
# sub:part_exclusion rdfs:subPropertyOf owl:differentFrom ; 
# owl:propertyDisjointWith sub:part . 


3.3 Comparability via Transitive Relations Plus Minimal Differentia 


When defining a type, a good practice is to specify (i) its similarities and differences 
with each of its direct supertypes (e.g., as in the genus & differentia design pattern), and 
(ii) its similarities and differences with each of its siblings for these supertypes. This is 
an often advocated best practice to improve the understandability of a type, as well as 
enabling more inferences. E.g., this is the “Differential Semantics” methodology of 
[13]. Several ODRs can be derived from this best practice, depending on how “dif- 
ference” is defined. In this article, the term “minimal-differentia” refers to a difference 
of at least one (inferred or not) relation in the compared type definitions: one more 
relation, one less or one with a type or destination that is different (semantically, not 
just syntactically). Furthermore, to check that a class is different from each of its 
superclasses (i.e. to extend the genus & differentia method), an rdfs:subClassOf 
relation between the two classes does not count as “differing relation”. When relevant, 
this ODR can be generalized to use other transitive relations between objects, e.g. 
partOf relations. 

For the “comparability relation type”, Fig. 1 proposes the option “comparability-or- 
uncomparability_with-minimal-differentia’”. For supporting this option when checking 
“comparability via subClassOf relations” between any pair of classes in a KB, the code 
of the SPARQL query of Sect. 2.2 can be adapted by adding some lines before the 
filters testing whether the classes are comparable or uncomparable: below, see the 
FILTER block from the 3rd line to the “...”. This block checks that there is a 
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“minimal-differentia” between the tested classes. The retrieval of automatically inferred 
relations relies on the use of a relevant entailment regime. 


SELECT distinct ?cl ?c2 WHERE 


{ ?cl rdf:type owl:Class. ?c2 rdf:type owl:Class. FILTER (?cl != ?c2) 
FILTER (! #skip classes satisfying the following conditions: 
( (EXISTS 
{ ?cl ?pl ?vl1l . #?cl has at least one property 
FILTER(?pl!=rdfs:subClassOf) # that is not rdfs:subClassOf 
FILTER #and 


( NOT EXISTS { ?c2 ?pl ?v2 } # ?pl is not in ?c2 
|| EXISTS { ?c2 ?pl ?v2 #or ?pl ?vl is not in ?c2 
FILTER (?vl != ?v2) } 
|| EXISTS { ?c2 ?p2 ?v2 #or ?p2 ?v2 is not in ?cl 
FILTER NOT EXISTS { ?cl ?p2 ?v2 } fe) 
}) 


|| ((NOT EXISTS #or 
{ 201 Ppl 771 # ?cl has no property, except may be 
FILTER ((?p1!=rdfs:subClassOf) # an rdfs:subClassOf property 
&& (?v1 != ?c2))} # to ?c2 
) && EXISTS{?c2 ?p2 ?v2}) # and ?c2 has (other) properties 


)) 


. #same filtering for (un-)comparable objects as in Section 2.2 


4 Other Comparisons with Other Works and Conclusion 


As previously illustrated, the “comparability ODR” generalizes — or permits one to 
generalize — some best practices, ontology patterns or methodologies that advocate the 
use of particular relations between particular objects, and supports an automated 
checking of the compliance with these practices, patterns or methodologies. This leads 
to the representation of knowledge that is more connected and precise, or with less 
redundancies. Since the comparability ODR can be used for evaluating a KB — e.g. by 
applying it to all its objects and dividing the number of successful cases by the number 
of objects — it can also be used to create KB evaluation criteria/measures, typically for 
measuring the (degree of) completeness of a KB, with respect to some criteria. 

As noted in [7], a survey on quality assessment for Linked Data, completeness 
commonly refers to a degree to which the “information required to satisfy some given 
criteria or a given query” are present in the considered dataset. To complement this 
very general definition, we distinguish two kinds of completeness. 


e Constraint-based completeness measures the percentage of elements in a dataset 
that satisfy explicit representations of what — or how — information must be rep- 
resented in the dataset. These representations are constraints such as integrity 
constraints or, more generally, constraints expressed by database schemas, struc- 
tured document schemas, or schemas enforcing ontology design patterns. E.g., in a 
particular dataset, the constraint that at least one movie must be associated to each 
movie actor, or the constraint that all relations must be binary. 
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e Real-world-based completeness measures the degree to which particular kinds of 
real-world information are represented in the dataset. E.g., regarding movies 
associated to an actor, calculating this completeness may consist in dividing “the 
number of movies associated to this actor in the dataset” by “the number of movies 
he actually played in, i.e. in the real world”. Either the missing information are 
found in a gold standard dataset or the degree is estimated via completeness oracles 
[19], i.e. rules or queries estimating what is missing in the dataset to answer a given 
query correctly. Tools such as SWIQA and Sieve help perform measures for this 
kind of completeness. 


All the completeness criteria/measures collected by [7] — schema/property/ 
population/interlinking completeness — “assume that a gold standard dataset is avail- 
able”. Hence, they are all subkinds of real-world based completeness. However, 
constraint-based completeness is equally interesting and, for its subkinds, categories 
named schema/property/population/interlinking completeness could also be used or 
have been used [1, 4]. What the comparability ODR can be reused for to ease the 
measure of completeness is about constraint-based completeness. As illustrated in this 
article, checking such a completeness may lead the KB authors to represent information 
that increase the KB precision and then enable the finding of yet-undetected problems. 
Increasing such a completeness does not mean increasing inferencing speed. 

This article showed how SPARQL queries could be used for implementing com- 
parability ODRs. More generally, most transformation languages or systems that 
exploit KRs could be similarly reused. [20] and [21] present such systems. The pro- 
posed SPARQL queries have been validated experimentally (using Corese [21], a tool 
which includes an OWL-2 inference engine and a SPARQL engine). Unsurprisingly, in 
the tested existing ontologies, many objects were not compliant with the ODRs. 
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Abstract. While the multilingual data on the Semantic Web grows 
rapidly, the building of multilingual ontologies from monolingual ones is 
still cumbersome and hampered due to the lack of techniques for cross- 
lingual ontology enrichment. Cross-lingual ontology enrichment greatly 
facilitates the semantic interoperability between different ontologies in 
different natural languages. Achieving such enrichment by human labor 
is a time-consuming and error-prone task. Thus, in this paper, we pro- 
pose a fully automated ontology enrichment approach using cross-lingual 
matching (OECM) approach, which builds a multilingual ontology by 
enriching a monolingual ontology from another one in a different natural 
language using a cross-lingual matching. OECM selects the best trans- 
lation among all available translations of ontology concepts based on 
their semantic similarity with the target ontology concepts. We present 
a use case of our approach for enriching English Scholarly Communi- 
cation Ontologies using German and Arabic ontologies from the Multi- 
Farm benchmark. We have compared our results with the results from 
the Ontology Alignment Evaluation Initiative (OAEI 2018). Our app- 
roach has higher precision and recall in comparison to five state-of-the- 
art approaches. Additionally, we recommend some linguistic corrections 
in the Arabic ontologies in Multifarm which have enhanced our cross- 
lingual matching results. 


Keywords: Cross-lingual ontology enrichment - Cross-lingual 
matching - Multilingual ontology - Ontology engineering - Knowledge 
management 


1 Introduction 


The wide proliferation of multilingual data on the Semantic Web results in many 
ontologies scattered across the web in various natural languages. According to 
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the Linked Open Vocabularies (LOV)!, the majority of the ontologies in the 
Semantic Web are in English, however, ontologies in other Indo-European lan- 
guages also exist. For instance, out of a total 681 vocabularies found in LOV, 
500 are in English, 54 in French, 39 in Spanish, and 33 in German. Few ontolo- 
gies exist in non-Indo-European languages, such as 13 in Japanese and seven in 
Arabic. Monolingual ontologies with labels or local names presented in a certain 
language are not easily understandable to speakers of other languages. Therefore, 
in order to enhance semantic interoperability between monolingual ontologies, 
approaches for building multilingual ontologies from the existing monolingual 
ones should be developed [26]. Multilingual ontologies can be built by applying 
cross-lingual ontology enrichment techniques, which expand the target ontol- 
ogy with additional concepts and semantic relations extracted from external 
resources in other natural languages [23]. For example, suppose we have two 
ontologies; Scientific Events Ontology in English (SEOen) and Conference in Ger- 
man (Conferenceg.). Both SEOen and Conferenceg. have complementary infor- 
mation, i.e. SEOen has some information which does not exist in Conferencege 
and vice versa. Let us consider a scenario where a user wants to get information 
from both SEO,,, and Conferencege to be used in an ontology-based application. 
This may not be possible without a cross-lingual ontology enrichment solution, 
which enrich the former by the complementary information in the latter. Manual 
ontology enrichment is a resource demanding and time-consuming task. There- 
fore, fully automated cross-lingual ontology enrichment approaches are highly 
desired [23]. Most of the existing work in ontology enrichment focus on enriching 
English ontologies from English sources only (monolingual enrichment) [23]. To 
the best of our knowledge, only our previous work [1,14] has addressed the cross- 
lingual ontology enrichment problem by proposing a semi-automated approach 
to enrich ontologies from multilingual text or from other ontologies in different 
natural languages. 

In this paper we address the following research question; how can we auto- 
matically build multilingual ontologies from monolingual ones? We propose a 
fully automated ontology enrichment approach in order to create multilingual 
ontologies from monolingual ones using cross-lingual matching. We extend our 
previous work [14] by: (1) using the semantic similarity to select the best trans- 
lation of class labels, (2) enriching the target ontology by adding new classes 
in addition to all their related subclasses in the hierarchy, (3) using ontologies 
in non-Indo-European languages (e.g., Arabic), as the source of information, (4) 
building multilingual ontologies, and (5) developing a fully automated approach. 
OECM comprises six phases: (1) translation: translate class labels of the source 
ontology, (2) pre-processing: process class labels of the target and the trans- 
lated source ontologies, (3) terminological matching: identify potential matches 
between class labels of the source and the target ontologies, (4) triple retrieval: 
retrieve the new information to be added to the target ontology, (5) enrichment: 
enrich the target ontology with new information extracted from the source ontol- 
ogy, and (6) validation: validate the enriched ontology. A noticeable feature of 


1 https: //lov.linkeddata.es /dataset /lov /vocabs. 


From Monolingual to Multilingual Ontologies 217 


OECM is that we consider multiple translations for a class label. In addition, the 
use of semantic similarity has significantly improved the quality of the match- 
ing process. We present a use case for enriching the Scientific Events Ontology 
(SEO) [9], a scholarly communication ontology for describing scientific events, 
from German and Arabic ontologies. We compare OECM to five state-of-the-art 
approaches for cross-lingual ontology matching task. OKCM outperformed these 
approaches in terms of precision, recall, and F-measure. Furthermore, we eval- 
uate the enriched ontology by comparing it against a Gold standard created by 
ontology experts. The implementation of OECM and the datasets used in the 
use case are publicly available?. 

The remainder of this paper is structured as follows: we present an overview of 
related work in Sect. 2. Overview of the proposed approach is described in Sect. 3. 
In order to illustrate possible applications of OECM, a use case is presented in 
Sect. 4. Experiments and evaluation results are presented in Sect. 5. Finally, we 
conclude with an outline of the future work in Sect. 6. 


2 Related Work 


A recent review of the literature on multilingual Web of Data found that the 
potential of the Semantic Web for being multilingual can be accomplished by 
techniques to build multilingual ontologies from monolingual ones [12]. Multi- 
lingual enrichment approaches are used to build multilingual ontologies from 
different resources in different natural languages [5,6,24]. Espinoza et al. [6] has 
proposed an approach to generate multilingual ontologies by enriching the exist- 
ing monolingual ontologies with multilingual information in order to translate 
these ontologies to a particular language and culture (ontology localization). In 
fact, ontology enrichment depends on matching the target ontology with external 
resources, in order to provide the target ontology with additional information 
extracted from the external resources. 

All the literature have focused on the cross-lingual ontology matching tech- 
niques which are used for matching different natural languages of linguistic infor- 
mation in ontologies [12,26]. Meilicke et al. [20] created a benchmark dataset 
(MultiFarm) that results from the manual translations of a set of ontologies 
from the conference domain into eight natural languages. This dataset is widely 
used to evaluate the cross-lingual matching approaches [7,15,16,28]. Manual 
translation of ontologies can be infeasible when dealing with large and com- 
plex ontologies. Trojahn et al. [27] proposed a generic approach which relies on 
translating concepts of source ontologies using machine translation techniques 
into the language of the target ontology. In the translation step, they depend 
on getting one translation for each concept (one-to-one translation), then they 
apply monolingual matching approaches to match concepts between the source 
ontologies and the translated ones. Fu et al. [10,11] proposed an approach to 
match English and Chinese ontologies by considering the semantics of the tar- 
get ontology, the mapping intent, the operating domain, the time and resource 


? https: //github.com/shmkhaled/OECM. 
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constraints and user feedback. Hertling and Paulheim [13] proposed an app- 
roach which utilizes Wikipedia’s inter-language links for finding corresponding 
ontology elements. Lin and Krizhanovsky [18] proposed an approach which use 
Wiktionary? as a source of background knowledge to match English and French 
ontologies. Tigrine et al. [25] presented an approach, which relies on the multilin- 
gual semantic network BabelNet* as a source of background knowledge, to match 
several ontologies in different natural languages. In the context of OAEI 2018 
campaign® for evaluating ontology matching technologies, AML [7], KEPLER 
[16], LogMap [15] and XMap [28] provide high-quality alignments. These sys- 
tems use terminological and structural alignments in addition to using external 
lexicon, such as WordNet® and UMLS-lexicon’ in order to get the set of syn- 
onyms for the ontology elements. In order to deal with multilingualism, AML 
and KEPLER rely on getting (one-to-one translation) using machine translation 
technologies, such as Microsoft translator, before starting the matching process. 
LogMap and XMap do not provide any information about the utilized transla- 
tion methodology. Moreover, LogMap is an iterative process, that starts from 
initial mappings (‘almost exact’ lexical correspondences) to discover new map- 
pings. It is mentioned in [15] that the main weakness of LogMap is that it can 
not find matching between ontologies which do not provide enough lexical infor- 
mation as it depends mainly on the initial mappings. A good literature of the 
state-of-the-art approaches in cross-lingual ontology matching is provided in [26]. 

Most of the literature have focused on enriching monolingual ontologies with 
multilingual information in order to translate or localize these ontologies. In 
addition, in the cross-lingual ontology matching task, there is a lack of exact one- 
to-one translation between terms across different natural languages which nega- 
tively affects the matching results. We address this limitations in our proposed 
approach by building multilingual ontologies, where a class label is presented by 
several natural languages, from monolingual ones. Such approach support the 
ontology matching process with multiple translations for a class label in order 
to enhance the matching results. 


3 The Proposed Approach 


Goal: Given two ontologies S and T, in two different natural languages Ls and 
L; respectively, as RDF triples (s,p,0) E€ C x R x (C UL) where C is the set of 
ontology domain entities (i.e. classes), R is the set of relations, and £ is the set 
of literals. We aim at finding the complementary information Te = S — (S A T) 
from S in order to enrich T. 

The proposed approach comprises six phases (Fig.1): translation, pre- 
processing, terminological matching, triple retrieval, enrichment, and validation. 


3 https: //www.wiktionary.org/. 

4 https: //babelnet.org/. 

5 http://oaei.ontologymatching.org/2018/results/multifarm /index.html. 
6 https: //wordnet.princeton.edu/. 

T https: //www.nlm.nih.gov/research/umls/. 
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Fig. 1. The workflow of OECM. 


The input is the two ontologies in two different natural languages, i.e. the target 
ontology T and the source ontology S. The output is the multilingual enriched 
ontology Tenrichea in two different natural languages Lı and Lə. In the following 
subsections, we describe each of these phases in details. 


3.1 Translation 


Let Cs and Cr be the set of classes in S and T respectively. Each class is rep- 
resented by a label or a local name. The aim of this phase is to translate each 
class in Cg to the language of T (i.e. L+). Google Translatorë is used to translate 
classes of source ontologies. All available translations are considered for each 
class. Therefore, the output of the translation is Cs—translatea Which has each 
class, in S, associated with a list of all available translations. For example, the 
class Thema in German has a list of English translations (Subject and Topic), 


and the class label siaal or in Arabic has a list of English translations such as 


“Review, Revision, Check”. The best translation will be selected in the termino- 
logical matching phase (Subsect. 3.3). 


3.2 Pre-processing 


The aim of this phase is to process classes of Cr and lists of translations in 
Cg—translatea by employing a variety of natural language processing (NLP) tech- 
niques, such as tokenization, POS-tagging (part-of-speech tagging), and lemma- 
tization, to make it ready for the next phases. In order to enhance the similarity 


8 https: //translate.google.com/. 
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Fig. 2. Illustration of a terminological matching between list of translations, in English, 
for every concept in C$_-anslateq) in Arabic, and C’r in English 


results between Cr and Cg_translated, Stop words are removed and normalization 
methods and regular expressions are used to remove punctuation, symbols, addi- 
tional white spaces, and to normalize the structure of strings. Furthermore, our 
pre-processing is capable of recognizing classes such as camel cases “Review Arti- 
cle” and adds a space between lower-case and upper-case letters “Review Arti- 
cle” (i.e. true casing technique). The output of this phase is C4, which has pre- 
processed translations of classes in T, and C4 _translatea, Which has pre-processed 
translations for each class in S. 


3.3 Terminological Matching 


The aim of this phase is to identify potential matches between class labels of S 
and T. We perform a pairwise lexical and/or semantic similarity between the list 
of translations of each class in CG_+,anstateq aud Cp to select the best translation 
for each class in S that matches the corresponding class in T (see Algorithm 1). 
Jaccard similarity [22] is used to filter the identical concepts instead of using 
semantic similarity from the beginning because there is no need for extra com- 
putations to compute semantic similarity between two identical classes. The 
reason behind choosing the Jaccard similarity is that according to the experi- 
ments conducted for the ontology alignment task for the MultiFarm benchmark 
in [2], Jaccard similarity has achieved the best score in terms of precision. For 
non-identical concepts, we compute the semantic similarity using the path length 
measure, based on WordNet, which returns the shortest path between two words 
in WordNet hierarchy [3]. If two words are semantically equivalent, i.e., belong- 
ing to the same WordNet synset, the path distance is 1.00. We use a specific 
threshold 0 in order to get the set of matched terms (matched classes) M. We 
obtained the best value of 6 = 0.9 which has the best matching results after 
running the experiments for ten times. If no match is found, we consider this 
class as a new class that can be added to T and we consider its list of trans- 
lations as synonyms for that class. Generally, class labels have more than one 
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Algorithm 1. Terminological Matching 


Data: C5_translated, Cr, 0 similarity threshold 
Result: M matched terms, C5_ translated 
1 foreach cs € CS translated; t € listOfTranslations, ct € Cp do 


2 similarityScore — getSimilarity(t,ct) 
3 if similarityScore > 0 then 
4 M ::= (t, ct) 
5 Ce xtranslated = update(C5_ translated) 
6 Function getSimilarity(sentencel, sentence2):double 
7 similarity — getJaccardSimilarity(sentencel, sentence2) 
8 if similarity 4 1 then 
9 similarity — (sentenceSimilarity(sentencel,sentence2) 
10 + sentenceSimilarity(sentence2,sentence1)) /2 
11 return similarity 
12 Function sentenceSimilarity(sentencel, sentence2):double 
13 simScore — 0.0 
14 count — 0.0 
15 foreach w; € sentencel.split(“ ") do 
16 foreach w; € sentence2.split(“ ") do 
17 | pathSim ::= getPathSimilarity(w;,w;) 
18 simScore+ = pathSim.maxz 
19 count+ = 1 
20 simScore — simScore/count 
21 return simScoure 


word, for example “InvitedSpeaker”, therefore, the semantic similarity between 
sentences presented in [21] is adapted as described in Algorithm 1 - line 9. Given 
two sentences sentencel and sentence2, the semantic similarity of each sentence 
with respect to the other is defined by: for each word w; € sentencel, the word 
wj in sentence2 that has the highest path similarity with w; is determined. The 
word similarities are then summed up and normalized with the number of similar 
words between the two sentences. Next, the same procedure is applied to start 
with words in sentence2 to identify the semantic similarity of sentence2 with 
respect to sentencel. Finally, the resulting similarity scores are combined using 
a simple average. Based on the similarity results, the best translation is selected 


and C%_ translated 8 updated. For example, in Fig. 2, the class Ky” in Arabic, 
has a list of English translations such as “President, Head, Chief”. After com- 
puting the similarity between C4 _iranslatea aud Ch, “President” has the highest 
similarityScore of 1.00 with the class “Chairman”, in C}, because they are seman- 
tically equivalent. Therefore, “President” is selected to be the best translation 


for eae ”. The output of this phase is the list of matched terms M between Ch 
and the updated C4 


S—translated* 
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Algorithm 2. Triple Retrieval 


Data: S, C5 trantated Cr, M 

Result: Te triples to be enriched 

Stranslated — translateOntologyClasses(S, C5_ translated) 

newClasses — M 

while InewClasses.isEmpty() do 
tempTriples — get'TriplesForNewClasses(Stranslated, NewClasses) 
newClasses — getClasses(tempT riples) .subtract(newClasses) 
newTriples — newTriples.union(tempT riples) 

otherLangTriples — getOtherLangTriples(newT riples, C5_ translated) 

Te — newTriples.union(foreignLanguageT riples) 


art aaron re 


3.4 Triple Retrieval 


The aim of this phase is to identify which and where the new information 
can be added to T. Each class in S is replaced by its best translation found 
in CG_translateq from the previous phase in order to get a translated ontology 
Stranslatea (see Algorithm 2). We design an iterative process in order to obtain 
Te, which is represented by (s,p,o0), that has all possible multilingual informa- 
tion from § to be added to T. We initiate the iterative process with all matched 
terms (newClasses = M) in order to get all related classes, if exist. The iter- 
ative process has three steps: (1) for each class c € newClasses, all triples 
tempTriples are retrieved from Stranslatea Where c is a subject or an object, (2) 
a new list of new classes is obtained from tempT'riples, (3) tempTriples is added 
to newTriples which will be added to T. These three steps are repeated until no 
new classes can be found (newClasses.isEmpty() = true). Next, we retrieve all 
available information from the other language for each class in newTriples such 


as (president, rdfs:label, “"@ar). The output of this phase is Ze which 
contains all multilingual triples (i.e., in Ls and L; languages) to be added to T. 


3.5 Enrichment 


The aim of this phase is to enrich T using triples in Ze. By using OECM, the 
target ontology can be enriched from several ontologies in different natural lan- 
guages sequentially, i.e. one-to-many enrichment. In this case, Tenricheq can have 
more than two natural languages. For instance, English T can be enriched from 
a German ontology, then the enriched ontology can be enriched again form a dif- 
ferent Arabic ontology, i.e. the final result for Tenrichea is presented in English, 
German, and Arabic. With the completion of this phase, we have successfully 
enriched T and create a multilingual ontology from monolingual ones. 


3.6 Validation 


The aim of this phase is to validate the enriched ontology, which is a crucial step 
to detect inconsistencies and syntax errors, which might be produced during 
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### https://w3id.org/seo#Publisher 
seo:Publisher rdf : type owl:Class ; 
rdfs:subClassOf <http://xmlns.com/foaf/0.1/Organization> ; 
rdfs:comment "The publisher of the event proceedings."@en ; 
rdfs:label "Publisher"@en . 
"Herausgeber" @de . 


### http://conference_de#CommitteeMember 
conference_de:CommitteeMember rdf: type owl:Class ; 
rdfs:subClassOf <http://xmlns.com/foaf/0.1/Person> ; 
rdfs:label "committee member"@en. 
"Angehörige des Ausschusses"@de . 


### https://w3id.org/seo#Chair 
seo:Chair rdf: type owl:Class; 
rdfs:subClassOf conference_de:CommitteeMember ; 
rdfs:label "Chair"@en . 
"Vorsitzender"@de . 


Fig. 3. Small fragment from SEOen—de ontology after the enrichment. The newly added 
information is marked in bold. 


the enrichment process [8]. There are two types of validations: syntactic and 
semantic validation. In the syntactic validation, we validate Tenrichea to conform 
with the W3C RDF standards using the online RDF validation service? which 
detects syntax errors, such as missing tags. For semantic validation, we use two 
reasoners, FaCT++ and HermiT, for detecting inconsistencies in Tenrichea [8]. 


4 Use Case: Enriching the Scientific Events Ontology 


In this use case, we use an example scenario to enrich the SEOen!? ontology 
(with 49 classes), in English, using the MultiFarm dataset (see Sect. 5). We use 
the Conference ontology (60 classes) and the ConfOf ontology (38 classes), in 
German and Arabic respectively, as source ontologies. This use case aims to show 
the whole process starting from submitting the source and target ontologies until 
producing the enriched multilingual ontology. Here, the source ontology is the 
German ontology Conferencege and the target ontology is the English ontol- 
ogy SEOen. The output is the enriched ontology SEOen—de, which becomes a 
multilingual ontology in English and German. Table 1 demonstrates the enrich- 
ment process for SEOen from Conferenceg. and shows the output sample of each 
phase starting from the translation phase to the produced set of triples which 
are used to enrich SEOen. In the terminological matching task, the relevant 
matching results (with similarity scores in bold) are identified with 0 > 0.9. 
The iterative process, in the triple retrieval phase, is initiated with the identi- 
fied matched terms, for example, person class. At the first iteration, six triples 


? https: //www.w3.org/RDF/Validator/. 
10 https: //w3id.org/seo. 
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Table 1. Use case: the sample output of each phase, from translation to triple retrieval. 


Phase Output 

Translation (Thema)ge — (subject, topic)en 
(Gutachter),. — (reviewer, expert)en 
(Herausgeber)ge — (publisher, editor). 


(Fortschritte der Konferenz)q. — (Progress of the conference) en 


Pre-processing SizeOrDuration — size duration 
WorkshopProposals — workshop proposal 
InvitedSpeaker — invite speaker 


n-useTrack — use track 


Terminological invited speaker, keynote speaker, 0.57) 
matching score results person, person, 1.00) 

tutorial, tutorial proposals, 0.78) 

prize, award, 1.00) 

conference document, license document, 0.61) 
publisher, publisher, 1.00) 

conference series, event series, 0.79) 
conference series, symposium series, 0.75) 
proceedings, proceedings, 1.00) 

poster, posters track, 0.78) 


Triple Retrieval st Iteration: 
(Iterative process) (conference contributor, rdfs:subClassOf, person) 


(committee member, rdfs:subClassOf, person) 
274 Iteration: 

committee member, rdf:type, Class) 

chairman, rdfs:subClassOf, committee member) 
conference contributor, rdf:type, Class) 


invited speaker, rdfs:subClassO0f, conference contributor) 


Triple Retrieval (Tz) committee member, rdf:type, Class) 
committee member, rdfs:label, "committee member" @en) 


committee member,rdfs:label,"Angehérige des Ausschusses"@de) 


( 
( 
( 
( 
(regular author, rdfs:subClassOf, conference contributor) 
( 
( 
( 
( 


chairman, rdfs:subClassOf, committee member) 


(not all results are exist in the table because of the limited space) are pro- 
duced such as (conference contributor, rdfs:subClassOf, person), where 
the matched term person is located at the object position. New classes are 
determined from the produced triples such as conference contributor and 
committee member (in bold). At the second iteration, all triples that have these 
new classes, as subject or object, are retrieved, for example; for the committee 
member class, the triples (committee member, rdf:type, Class) and (chairman, 
rdfs:subClassOf, committee member) are retrieved. This process is repeated 
again and new classes are identified from the produced triples such as chairman. 
The iterative process ended at the fifth iteration where three triples are pro- 
duced without any new classes. The output of this phase is Ze which has 40 
new triples (with 20 new classes and their German labels), to be added to 
SEO,_,, and produce SEO¢n_ae. Figure 3 shows a small fragment of the enriched 
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ontology SEOen-ae, in Turtle, after completing the enrichment process. The 
resulting multilingual ontology contains a newly added class CommitteeMember 
with its English and German labels, a new relation rdfs:subClassOf between 
the two classes CommitteeMember and Chair, and new German labels such as 
Herausgeber and Vorsitzender for classes Publisher and Chair respectively. 
Similarly, SEOen-ae is enriched from the Arabic ontology ConfOf,,, where 
all classes with English labels in SEOen-aqe are matched with class labels in 
ConfOf,,. The produced SEOen—de—ar has 113 new triples with 37 new classes 
with their Arabic labels. Final output results can be found at the OECM GitHub 
repository. 


5 Evaluation 


The aim of this evaluation is to measure the quality of the cross-lingual matching 
process in addition to the enrichment process. We use ontologies in MultiFarm 
benchmark"!, a benchmark designed for evaluating cross-lingual ontology match- 
ing systems. MultiFarm consists of seven ontologies (Cmt, Conference, ConfOf, 
Edas, Ekaw, Iasted, Sigkdd) originally coming from the Conference benchmark of 
OAEI, their translation into nine languages (Chinese, Czech, Dutch, French, Ger- 
man, Portuguese, Russian, Spanish and Arabic), and the corresponding cross- 
lingual alignments between them. 


Experimental Setup. All phases of OECM have been implemented using Scala 
and Apache Spark!?. SANSA-RDF library!? [17] with Apache Jena framework!4 
are used to parse and manipulate the input ontologies (as RDF triples). In order 
to process the class labels, the Stanford CoreNLP?° [19] is used. All experiments 
are carried out on Ubuntu 16.04 LTS operating system with an Intel Corei7- 
4600U CPU @ 2.10GHz x 4 CPU and 10GB of memory. In our experiments, 
we consider English ontologies as target ontologies to be enriched from German 
and Arabic ontologies. 

Our evaluation has three tasks: (1) evaluating the effectiveness of the cross- 
lingual matching process in OECM compared to the reference alignment pro- 
vided in the MultiFarm benchmark, (2) comparing OECM matching results with 
four state-of-the-art approaches, in addition to our previous work (OECM 1.0) 
[14], and (3) evaluating the quality of the enrichment process. 


Effectiveness of OECM. In this experiment, we use the English version of Cmt 
ontology as the source ontology, and German and Arabic versions of Conference, 
ConfOf, and Sigkdd ontologies as target ontologies. We match class labels in 
Cmt ontology with class labels of German and Arabic versions of Conference, 
ConfOf, and Sigkdd ontologies separately. The resulting alignments are compared 


" https: //www.irit.fr/recherches/MELODI/multifarm/. 
12 https: //spark.apache.org/. 

13 https: //github.com/SANSA-Stack/SANSA-RDF. 

14 https: //jena.apache.org/. 

15 https: //stanfordnlp.github.io/CoreNLP/. 
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Table 2. Precision, recall and F-measures for the cross-lingual matching 


Ontology pairs German x English Arabic x English 
Precision | Recall | F-measure | Precision Recall F-measure 
Before | After | Before | After | Before | After 
Conference x Cmt | 1.00 0.38 0.56 1.00 1.00 | 0.33 0.42 | 0.50 0.59 
ConfOf x Cmt 1.00 0.70 0.82 1.00 1.00 | 0.30 0.60 | 0.46 0.75 
Sigkdd x Cmt 1.00 0.90 0.95 1.00 1.00 | 0.40 0.80 | 0.57 0.89 


with the reference alignments, as a gold standard, provided in the benchmark 
for each pair of ontologies. Table2 shows the precision, recall and F-measure 
of the matching process for each pair of ontologies. OECM achieves the highest 
precision of 1.00 for all pair of ontologies. Meanwhile, OECM achieves the highest 
recall and F-measure of 0.90 and 0.95 respectively for matching the German 
Sigkdd with the English Cmt. As two authors of this work are native speakers 
of Arabic, we found some linguistic mistakes in the Arabic ontologies which 
negatively affect the translation and the matching results. Therefore, we correct 
these mistakes and make it available at the OECM GitHub repository. Matching 
results before and after the corrections are presented in the table, where such 
corrections have greatly improved the matching results in terms of recall and 
F-measure. For instance, in matching the Arabic Sigkdd with the English Cmt, 
recall and F-measure are enhanced by 40% and 32% respectively. 


Comparison with the State-of-the-Art. We identified four of the related 
approaches (AML, KEPLER, LogMap, and XMap) to be included in our evalua- 
tion in addition to OECM 1.0. The other related work, neither publish their code, 
nor their evaluation datasets [10,11,25]. In order to compare our results with 
the state-of-the-art, we use German (Conferenceg,) and Arabic (Conference,,) 
versions of the Conference ontology as the source ontologies, and Ekawe, and 
Edasen ontologies as the target English ontologies. We choose Ekawen and Edasey, 
ontologies in this evaluation because they are used in the state-of-the-art sys- 
tems for evaluation, as mentioned in the results of OAEI 2018. We generate the 
gold standard alignments between each pair of ontologies using the Alignment 
API 4.9!°, as used by the state-of-the-art systems, in order to compute preci- 
sion, recall, and F-measures. Table 3 shows the comparison between our results 
against four state-of-the-art approaches and OECM 1.0 (results for matching 
English and German ontologies only). In addition, we add the updated Arabic 
ontology (Conference’,,.) with our linguistic correction in the matching process in 
order to show the effectiveness of such corrections. The current version of OECM 
(OECM 1.1) outperforms all other systems in precision, recall and F-measure. 
For instance, when matching Conferenceg. x Ekawe,, OECM 1.1 outperforms 
LogMap, the highest precision, recall and F-measure among the others, by 29%, 
60% and 58% in terms of precision, recall and F-measure respectively. The use 
of semantic similarity in OECM 1.1 significantly improves the matching results 


16 http: //alignapi.gforge.inria.fr/. 
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Table 3. State-of-the-art comparison results. Bold entries are the top scores. 


Approaches Conferencege X Ekawen Conferencege X Edasen 
Precision | Recall F-measure | Precision | Recall | F-measure 
AML [7] 0.56 0.20 | 0.29 0.86 0.35 | 0.50 
KEPLER [16] | 0.33 0.16 | 0.22 0.43 0.18 | 0.25 
LogMap [15] 0.71 0.20 | 0.31 0.71 0.29 | 0.42 
XMap [28] 0.18 0.16 | 0.17 0.23 0.18 | 0.20 
OECM 1.0 [14] | 0.75 0.67 (0.71 0.93 0.76 | 0.84 
OECM 1.1 1.00 0.80 0.89 1.00 0.78 | 0.88 
Conferenceg; X Ekawen Conferenceg; X Edasen 
AML [7| 0.64 0.39 | 0.28 0.71 0.42 | 0.29 
KEPLER [16] | 0.40 0.30 | 0.24 0.40 0.30 | 0.24 
LogMap [15] 0.40 0.13 | 0.08 0.40 0.18 |0.12 
XMap [28] 1.00 0.0 0.0 1.00 0.00 | 0.00 
OECM 1.1 1.00 0.50 0.67 0.86 0.67 | 0.75 
OECM 1.1 Conference’,,; X Ekawen Conference’,, X Edasen 
0.88 0.70 0.78 1.00 0.78 | 0.88 


compared to the results of OECM 1.0. For instance, when matching Conferenceg, 
x Ekawen, matching results in OECM 1.0 have been enhanced by 25%, 13%, 
and 18% in terms of precision, recall and F-measure respectively. When matching 
Conferenceg, X Edasen, XMap outperform OECM by 14% in terms of precision, 
while OECM outperforms it in both recall and f-measure. It is observed that the 
precision of OECM slightly decreased because of the linguistic mistakes found in 
Conference,,;. When considering Conference’,,;, which has the linguistic correc- 
tion, as a source ontology in this matching, the matching results are improved. 


Evaluating the Enrichment Process. According to [4], the enriched ontology 
can be evaluated by comparing it against a predefined reference ontology (Gold 
standard). In this experiment, we evaluate the enriched ontology SEOen-de 
(cf. Sect.4). A gold standard ontology has been manually created by ontol- 
ogy experts. By comparing SEO., ae with the gold standard, OECM achieves 
1.00, 0.80, and 0.89 in terms of precision, recall, and F-measure respectively. 
This finding confirms the usefulness of our approach in cross-lingual ontology 
enrichment. 


6 Conclusion 


We present a fully automated approach, OECM, for building multilingual ontolo- 
gies. The strength of our contribution lies on building such ontologies from 
monolingual ones using cross-lingual matching between ontologies concepts. Indo 
and non-Indo-European languages resources are used for enrichment in order to 
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illustrate the robustness of our approach. Considering multiple translations of 
concepts and the use of semantic similarity measures for selecting the best trans- 
lation have significantly improved the quality of the matching process. Iterative 
triple retrieval process has been developed to determine which information, from 
the source ontology, can be added to the target ontology, and where such infor- 
mation should be added. We show the applicability of OECM by presenting a 
use case for enriching an ontology in the scholarly communication domain. The 
results of the cross-lingual matching process are found promising compared to 
five state-of-the-art approaches, involving the previous version of OECM. Fur- 
thermore, evaluating the quality of the enrichment process emphasizes the valid- 
ity of our approach. Finally, we propose some linguistic corrections for the Arabic 
ontologies in the MultiFarm benchmark that used in our experiment, which con- 
siderably enhanced the matching results. In conclusion, our approach provides 
a springboard for a new way to build multilingual ontologies from monolingual 
ones. In the future, we intend to further consider properties and individuals in 
the enrichment process. In addition, we aim to apply optimization methods in 
order to evaluate the efficiency of OECM when enriching very large ontologies. 
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Abstract. In this paper, we present the Ontology Matching EvaLuation 
Toolkit (MELT), a software toolkit to facilitate ontology matcher devel- 
opment, configuration, evaluation, and packaging. Compared to existing 
tools in the ontology matching domain, our framework offers detailed 
evaluation capabilities on the correspondence level of alignments as well 
as extensive group evaluation possibilities. A particular focus is put on a 
streamlined development and evaluation process along with ease of use 
for matcher developers and evaluators. Our contributions are twofold: 
We present an open source matching toolkit that integrates well into 
existing platforms, as well as an exemplary analysis of two OAEI 2018 
tracks demonstrating advantages and analytical capabilities of MELT. 
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1 Introduction 


Ontology matching or ontology alignment is the non-trivial task of finding cor- 
respondences between entities of a set of given ontologies [10]. The matching can 
be performed manually or through the use of an automated matching system. 
For systematically evaluating the quality of such matchers, the Ontology Align- 
ment Evaluation Initiative (OAEI) has been running campaigns [9] every year 
since 2005. Unlike other evaluation campaigns where researchers submit data 
sets as solutions to report their results (such as Kaggle'), the OAEI requires 
participants to submit a matching system, which is then executed on-site. After 
the evaluation, the results are publicly reported”. Therefore, execution and eval- 
uation platforms have been developed and OAEI participants are required to 
package and submit their matching system for the corresponding platform. Two 
well-known platforms are used in the ontology matching community: The Seman- 
tic Evaluation at Large Scale (SEALS)? [12,35] and the more recent Holistic 
Benchmarking of Big Linked Data (HOBBIT)* [24]. 


1 https: //www.kaggle.com. 

? http://oaei.ontologymatching.org/2018/results/index.html. 
3 http: //www.seals-project.eu. 

* http://project-hobbit.eu. 
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Based on the results of the OAEI 2018 campaign [1], only 4 out of 12 tracks 
were available in HOBBIT (LargeBio, Link Discovery, SPIMBENCH, Knowl- 
edgeGraph). Out of 19 matchers that were submitted in the 2018 campaign, 
only 6 matchers supported both, SEALS and HOBBIT, and 2 supported HOB- 
BIT exclusively. The remaining 11 matchers supported only SEALS. While one 
reason for the low HOBBIT adoption might be its novelty, it also requires more 
steps to package a matcher for the HOBBIT platform and knowledge of the 
Docker® virtualization software. In particular for new entrants to the ontology 
matching community, the existing tooling might appear overwhelmingly compli- 
cated. In addition to potential obstacles for matcher development and submis- 
sion, another observation from the OAEI campaigns is that the evaluation varies 
greatly among the different tracks that are offered e.g. Anatomy results contain 
Recall+ as well as alignment coherence whereas the Conference track focuses on 
different reference alignments. Due to limited group evaluation capabilities in 
existing frameworks, some track organizers even developed their own evaluation 
systems. 

For these reasons we present the Matching EvaLuation Toolkit (MELT)® — an 
open source toolkit for ontology matcher development, fine-tuning, submission, 
and evaluation. The target audience are matching system developers as well as 
researchers who run evaluations on multiple matching systems such as OAEI 
track organizers. Likewise, system developers can use this tool to analyze the 
performance and errors of their systems in order to improve it. Furthermore, 
they can package and submit the system easily to OAEI campaigns. 

The rest of this paper is structured as follows: Sect. 2 describes other work in 
the field of alignment visualization and evaluation. Section 3 gives an overview 
of the MELT framework and its possibilities whereas Sect. 4 shows an exemplary 
analysis of the latest systems submitted to the OAEI. We finish with an outlook 
on future developments. 


2 Related Work 


As MELT can be used both for evaluating ontology matching tools, as well as 
visualizing matching results, we discuss related works in both fields. 


2.1 Matching and Alignment Evaluation Platforms 


OAEI campaigns consist of multiple problem sets, so called tracks. Each track has 
its organizers who provide the datasets including reference alignments, execute 
the matching systems, and prepare the results page for the participants and the 
whole community. The track contains one or more test cases which correspond to 
a specific matching task consisting of two ontologies and a reference alignment. 
In 2010, three tracks (Benchmark, Anatomy, and Conference) were adjusted 


5 https: //www.docker.com. 
6 https://github.com/dwslab/melt. 
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to be run with the SEALS platform [8]. One year later, participants of OAEI 
campaigns had to implement a matching interface and the SEALS client was the 
main tool used for executing and evaluating matchers. The interface contains a 
simple method (align()) which receives a URL for the source and a URL for 
the target ontology and has to return a URL which points to a file containing 
all correspondences in the alignment format’. This format is defined and used 
by the Alignment API [5]. 

Starting from 2017, a second evaluation platform, called HOBBIT, was added 
[18]. One difference compared to SEALS is that the system has to be submitted 
as a Docker image to a GitLab instance®, and in the corresponding project, 
a matcher description file has to be created. After submission of the matching 
system, the whole evaluation runs on servers of the HOBBIT platform. Thus, the 
source code for evaluating the matchers has to be submitted as a Docker image 
as well. All Docker containers communicate with each other over a message 
broker (RabbitMQ°). Hence, the interface between a system and the evaluation 
component can be arbitrary. To keep a similar interface to SEALS, the data 
generation component transfers two ontologies and the system adapter receives 
the URL to these files. It should return a file similar to the SEALS interface. 

Working with alignments in Java code can be achieved with the Alignment 
API [5]. It is the most well-known API for ontology matching and can be used 
for loading and persisting alignments as well as for evaluating them with a set 
of possible evaluation strategies. Moreover, it provides some matching systems 
which are also used in OAEI campaigns as a baseline. Unfortunately, it is not 
yet enabled to be used with the maven build system!°. Therefore, instead of 
using this API, some system developers created their own classes to work with 
alignments and to store them on disk!’ in order to be compatible with the 
evaluation interface. 

Alignment Visualization. A lot of work has been done in the area of analyzing, 
editing, and visualizing alignments or ontologies with a graphical user interface. 
One example is Alignment Cubes [15], which allows an interactive visual explo- 
ration and evaluation of alignments. An advantage is the fine grained analysis 
on the level of an individual correspondence. It further allows to visualize the 
performance history of a matcher, for instance, which correspondences a matcher 
found in the most recent OAEI campaign but not in the previous one. Another 
framework for working with alignment files is VOAR [28,29]. It is a Web-based 
system where users can upload ontologies and alignments. VOAR then allows 
the user to render them with multiple visualization types. The upload size of 
ontologies as well as alignments is restricted so that very large files cannot be 
uploaded. 


T http: / /alignapi.gforge.inria.fr/format.html. 
8 https: //master.project-hobbit.eu. 
? https: //www.rabbitmq.com. 
10 https: //maven.apache.org/. 
11 https://github.com/ernestojimenezruiz/logmap-matcher/tree/master/src/main/ 
java/uk/ac/ox/krr/logmap_lite/io. 
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Similar to VOAR, the SILK workbench [33] is also a Web-based tool with a 
focus on link/correspondence creation between different data sets in the Linked 
Open Data Cloud!*. Unlike VOAR, it usually runs on the user’s computer. 
Matching operations (such as Levenshtein distance [20]) are visualized as nodes 
in a computation graph. The found correspondences are displayed and can be 
modified to further specify which concepts should be matched. 

Further visualization approaches were pursued by matching system develop- 
ers to actually fine-tune their systems. All these visualizations are therefore very 
specific to a particular matching approach. One such example is YAM++ [23], 
which is a matching system based on a machine learning approach. Results are 
visualized in a split view where the class hierarchy of the two input ontologies is 
shown on each side lines are drawn between the matched classes. The user can 
modify the alignment with the help of this GUI. In a similar way, the develop- 
ers of COMA++ [2] created a user interface for their results. A visualization of 
whole ontologies is not implemented by the current tools but can be achieved 
with the help of VOWEL [21] or Web Protégé [32], for instance. 

Our proposed framework MELT allows for detailed and reusable analyses 
such as the ones presented in this section due to its flexible metrics and evalua- 
tors. An overview of the framework is presented in the following section. 


3 Matching Evaluation Toolkit 


MELT is a software framework implemented in Java which aims to facilitate 
matcher development, configuration, packaging, and evaluation. In this section, 
we will first introduce Yet Another Alignment API, an API for ontology align- 
ment which is integrated into the framework. Afterwards, the matcher devel- 
opment process in MELT is introduced. Subsections 3.3 and 3.4 cover specific 
aspects of the framework that have not yet been explicitly addressed in the 
community: The implementation of matchers outside of the Java programming 
language Subsect.3.3 and the chaining matching workflows Subsect.3.4. After 
explaining the tuning component of the framework, this section closes with the 
matcher evaluation process in MELT. 


3.1 YAAA: Yet Another Alignment API 


To allow for a simple development workflow, MELT contains Yet Another Align- 
ment API (YAAA). It is similar to the Alignment API presented earlier but 
contains additional improvements such as maven support and arbitrary index- 
ing possibilities of correspondence elements allowing queries such as “retrieve all 
correspondences with a specific source”. This is very helpful for a fast evaluation 
of large-scale test cases containing large reference or system alignments. The 
indexing is done with the cqengine library. The API is, in addition, capable 


12 https: //lod-cloud.net. 
13 https: //github.com/npgall/cqengine/. 
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of serializing and parsing alignments. It also makes sure that all characters are 
escaped and that the resulting XML is actually parseable!*. As explainability is 
still an open issue in the ontology matching community [7,34], YAAA also allows 
for extensions to correspondences and alignments. This means that additional 
information such as debugging information or human-readable explanations can 
be added. If there is additional information available in the alignment, it will 
also be printed by the default CSVEvaluator which allows for immediate con- 
sumption in the analysis and evaluation process and hopefully fosters the usage 
of additional explanations in the alignment format. 

It is important to note that MELT does not require the usage of YAAA for 
parameter tuning, executing, or packaging a matcher — but also works with other 
APIs such as the Alignment API. This allows to evaluate matchers that were 
not developed using YAAA (see Sect. 4). 


3.2 Matcher Development Workflow 


In order to develop a matcher in Java with MELT, the first step is to decide 
which matching interface to implement. The most general interface is encap- 
sulated in class MatcherURL which receives two URLs of the ontologies to be 
matched together with a URL referencing an input alignment. The return value 
should be a URL representing a file with correspondences in the alignment for- 
mat. Since this interface is not very convenient, we also provide more specialized 
classes. In the matching-yaaa package we set the alignment library to YAAA. 
All matchers implementing interfaces from this package have to use the library 
and get at the same time an easier to handle interface of correspondences. In 
further specializations we also set the Semantic Web framework which is used to 
represent the ontologies. For a better usability, the two most well-known frame- 
works are integrated into MELT: Apache Jena’ [3] (MatcherYAAAJena) and the 
OWL API"® [14] (MatcherYAAAOwlApi). As the latter two classes are organized 
as separate maven projects, only the libraries which are actually required for the 
matcher are loaded. In addition, further services were implemented such as an 
ontology cache which ensures that ontologies are parsed only once. This is help- 
ful, for instance, when the matcher accesses an ontology multiple times, when 
multiple matchers work together in a pipeline, or when multiple matchers shall 
be evaluated. We explicitly chose a framework-independent architecture so that 
developers can use the full functionality of the frameworks they already know 
rather than having to understand an additional wrapping layer. The different 
levels at which a matcher can be developed as well as how the classes presented 
in this section work together, are displayed in Fig. 1. 


14 This is not always the case for other implementations. 
15 https: //jena.apache.org. 
16 http: //owlcs.github.io/owlapi/. 
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Fig. 1. Different possibilities to implement matchers 


3.3 External Matching 


The current ontology matching development and evaluation frameworks that 
are available focus on the Java programming language. As researchers apply 
advances in machine learning and natural language processing to other domains, 
they often turn to Python because leading machine learning libraries such as 
scikit-learn'", TensorFlow'®, PyTorch'®, Keras?°, or gensim [26] are not easily 
available for the Java language. In the 2018 OAEI campaign, the first tools using 
such frameworks for ontology matching have been submitted [1]. 

To accommodate for the changes outlined, MELT allows to develop a matcher 
in any other programming language and wrap it as a SEALS or HOBBIT pack- 
age. Therefore, class MatcherExternal has to be extended. It has to transform 
the given ontology URIs and input alignments to an executable command line 
call. The interface for the external process is simple. It receives the input vari- 
ables via the command line and outputs the results via the standard output 
of the process — similar to many Unix command line tools. An example for a 


17 https://scikit-learn.org/. 

18 https: //www.tensorflow.org/. 
19 https://pytorch.org/. 

20 https://keras.io/. 
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matcher implemented in Python is available on GitHub?! It also contains a sim- 
ple implementation of the alignment format to allow Python matchers serializing 
their correspondences. 

When executing the matcher with the SEALS client, the matching system 
is loaded into the Java virtual machine (JVM) of the SEALS client (evaluation 
code) with a customized class loader. This raises two points: (1) The code under 
test is executed in the same JVM and can probably access the code for eval- 
uation. (2) The used class loader from the JCL library?? does not implement 
all methods (specifically getPackage() and getResource()) of a class loader. 
However, these methods are used by other Java libraries?’ to load operating 
system dependent files contained in the jar file. Thus, some libraries do not work 
when evaluating a matcher with SEALS. Another problem is that all libraries 
used by the matching system may collide with libraries used by SEALS. This can 
cause issues with Jena and other Semantic Web frameworks because of the same 
JVM instance. To solve this issue, MatcherExternal can not only be used for 
matchers written in another programming language but also for Java matchers 
which use dependencies that are incompatible with the SEALS platform. 


3.4 Pipelining Matchers 


Ontology matchers often combine multiple matching approaches and sometimes 
consist of the same parts. An example would be a string-based matching of 
elements, and the application of a stable marriage algorithm or another matching 
refinement step on the resulting similarity matrix. 

Following this observation, MELT allows for the chaining of matchers: The 
alignment of one matcher is then the input for the next matcher in the pipeline. 
The ontology caching services of MELT mentioned above prevent performance 
problems arising from repetitive loading and parsing of ontologies. 

In order to execute a matcher pipeline, classes MatcherPipelineYAAA (for 
matchers that use different ontology management frameworks), MatcherPipe- 
lineYAAAJena (for pure Jena pipelines), and MacherPipelineYAAAOwlApi (for 
pure OWL API pipelines) can be extended. Here the initializeMatchers() 
method has to be implemented. It returns matcher instances as a List in the 
order in which they shall be executed. These reusable parts of a matcher can 
easily be uploaded to GitHub to allow other developers to use common func- 
tionality’. 


21 https: //github.com/dwslab/melt /tree/master /examples/externalPythonMatcher. 

2 https: //github.com/kamranzafar /JCL/blob/master /JCL/src/xeus/jcl/ 
AbstractClassLoader.java. 

23 An example would be class SQLiteJDBCLoader in sqlite-jdbc which uses these class 
loader methods. 

24 Other GitHub dependencies can be included by using https: //jitpack.io, for instance. 
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3.5 Tuning Matchers 


Many ontology matching systems require parameters to be set at design time. 
Those can significantly influence the matching system’s performance. An exam- 
ple for a parameter would be the threshold parameter of a matcher utilizing 
a normalized string distance metric. For tuning such a system, MELT offers a 
GridSearch functionality. It requires a matcher and one or more parameters 
together with their corresponding search spaces, i.e. the values that shall be 
tested. The Cartesian product of these values is computed and each system 
configuration (an element of the Cartesian product which is a tuple of values) 
runs on the specified test case. The result is an ExecutionResultSet which can 
be further processed like any other result of matchers in MELT. To speed up 
the execution, class Executor was extended and can run matchers in parallel. 
Properties can be specified by a simple string. Therefore, the JavaBeans spec- 
ification?” is used to access the properties with so called setter-methods. This 
strategy allows also to change properties of nested classes or any list or map. An 
example of a matcher tuning can be found in the MELT repository”®. 


3.6 Evaluation Workflow 


MELT defines a workflow for matcher execution and evaluation. Therefore, it 
utilizes the vocabulary used by the OAEI: A matcher can be evaluated on a 
TestCase, i.e. a single ontology matching task. One or more test cases are sum- 
marized in a Track. MELT contains a built-in TrackRepository which allows 
to access all OAEI tracks and test cases at design time without actually down- 
loading them from the OAEI Web page. At runtime TrackRepository checks 
whether the required ontologies and alignments are available in the internal 
buffer; if data is missing, it is automatically downloading and caching it for the 
next access. The caching mechanism is an advantage over the SEALS platform 
which downloads all ontologies again at runtime which slows down the evaluation 
process if run multiple times in a row. 

One or more matchers are given, together with the track or test case on 
which they shall be run, to an Executor. The Executor runs a matcher or a 
list of matchers on a single test case, a list of test cases, or a track. The run() 
method of the executor returns an ExecutionResultSet. The latter is a set of 
ExecutionResult instances which represent individual matching results on a 
particular test case. Lastly, an Evaluator accepts an ExecutionResultSet and 
performs an evaluation. Therefore, it may use one or more Metric objects. MELT 
contains various metrics, such as a ConfusionMatrixMetric, and evaluators. 
Nonetheless, the framework is designed to allow for the further implementation 
of evaluators and metrics. 


25 https: //www.oracle.com/technetwork /java/javase /documentation/spec- 136004. 
html. 

26 https: //github.com/dwslab/melt /blob/master /examples/simpleJavaMatcher / 
src/test/java/de/uni_mannheim/informatik/dws/ontmatching /demomatcher / 
EvaluateMatcher.java. 
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After the Executor has run, an ExecutionResult can be refined by a 
Refiner. A refiner takes an individual ExecutionResult and makes it smaller. 
An example is the TypeRefiner which creates additional execution results 
depending on the type of the alignment (classes, properties, datatype properties, 
object properties, instances). Another example for an implemented refiner is the 
ResidualRefiner which only keeps non-trivial correspondences and can be used 
for metrics such as recall+. Refiners can be combined. This means that MELT 
can calculate very specific evaluation statistics such as the residual precision of 
datatype property correspondences. 

A novelty of this framework is also the granularity at which alignments can 
be analyzed: The EvaluatorCSV writes every correspondence in a CSV format 
together with further details about the matched resources and the performed 
refinements. This allows for an in-depth analysis in various spreadsheet appli- 
cations such as LibreOffice Calc where through the usage of filters analytical 
queries can be performed such as “false-positive datatype property matches by 
matcher X on test case Y”. 


4 Exemplary Analysis of OAEI 2018 Results 


In order to demonstrate the capabilities of MELT, a small analysis of the OAEI 
2018 results for the Conference and Anatomy track has been performed and is 
presented in the following. 

The Conference track consists of 16 ontologies from the conference domain. 
We evaluated all matching systems that participated in the 2018 campaign: 
ALIN [30], ALOD2Vec [25], AML [11], DOME [13], FCAMapX [4], Holontology 
[27], KEPLER [19], Lily [31], LogMap and LogMapLt [17], SANOM [22], as well 
as XMap [6]. 

The Anatomy track consists of a mapping between the human anatomy and 
the anatomy of a mouse. In the 2018 campaign, the same matchers mentioned 
above participated with the addition of LogMapBio, a matcher from the LogMap 
family [17]. 

First, the resulting alignments for Anatomy?’ and Conference?® have been 
downloaded from the OAEI Web site. As both result sets follow the same struc- 
ture every year, the MELT functions Executor .loadFromAnatomyResultsFol- 
der () and Executor.loadFromConferenceResultsFolder() were used to load 
the results. The resulting ExecutionResultSet was then handed over to the 
MatcherSimilarityMetric and rendered using the MatcherSimilarityLatex- 
HeatMapWriter. As the Conference track consists of multiple test cases, the 
results have to be averaged. Here, out of the available calculation modes in 
MELT, micro-average was chosen as this calculation mode is also used on the 


27 http: //oaei.ontologymatching.org/2018/results /anatomy /oaei2018-anatomy- 
alignments.zip. 

28 http://oaei.ontologymatching.org/2018/conference/data/conference2018-results. 
zip. 
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official results page? to calculate precision and recall scores. Altogether, the 
analysis was performed with few lines of Java code.°° 

Tables 1 and 2 show the Jaccard overlap [16] of the correspondences rendered 
as heat map where darker colors indicate a higher similarity. The Jaccard coeffi- 
cient J € [0,1] between two alignments a; and az with correspondences corr(a1) 
and corr(az) was obtained as follows: 


|corr(a1) N corr(ag)| 


J = 
(a1, 42) \corr(a,) U corr (a2)| 


In Table 1 it can be seen that — despite the various approaches that are pur- 
sued by the matching systems — most of them arrive at very similar alignments. 
One outlier in this statistic is Holontology. This is due to the very low number 
of correspondences overall found by this matching system (456 as opposed to 
ALIN, which had the second-smallest alignment with 928 matches). 

Similarly, the matching systems of the Conference track also show commonal- 
ities in their alignments albeit the similarity here is less pronounced compared to 
the Anatomy track: The median similarity (excluding perfect similarities due to 
self-comparisons) of matching systems for Anatomy is median Anatomy = 0.7223 
whereas the median similarity for Conference is mediancon ference = 0.5917. The 
lower matcher similarity median indicates that Conference is a harder match- 
ing task because the matching systems have more disagreement about certain 
correspondences. 

In a second step, the same result from the MatcherSimilarityMetric has 
been printed by another writer (MatcherSimilarityLatexPlotWriter) which 
plots the mean absolute deviation (MAD) on the X-axis and the F; score on the 
Y-axis. MAD was obtained for each matcher by applying 


1 n 
MAD = — i— X 
— >a mean(X )| 


i=l 


where X is the set of Jaccard similarities for a particular matcher. The resulting 
plots are shown in Figs. 2 and 3. It can be seen that the matchers form different 
clusters: Anatomy matchers with a high Fı measure have also a high deviation. 
Consequently, those matchers are likely candidates for a combination to achieve 
better results. On Conference, on the other hand, good combinations cannot 
be derived because the best matchers measured by their F; score tend not to 
deviate much in their resulting alignments. 

In addition to the evaluations performed using the matcher similarity metric, 
the EvaluatorCSV was run using the OAEI 2018 matchers on the Anatomy and 
Conference tracks. The resulting CSV file contains one row for each correspon- 
dence together with additional information about each resource that is mapped 


29 http: //oaei.ontologymatching.org/2018/results/conference/. 
3° The code to run the analysis can be found on GitHub: https://github.com/dwslab/ 
melt /tree/master /examples/analyzingMatcherSimilarity. 
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Table 1. OAEI anatomy 2018 alignment similarity 


© A Re xX 

ae T Ee S S č 

g Y € Q s £ R EL 
eSwoess & & Ss FS SF £ Fe 
Fy YF oT F & S& S$ S$ BST oF ZX 
TT YOOO 
ALIN 0.79 10.6: 0.8 5 


ALOD2Vec IKRE N60) 0.94) 0.77 0.45) 1 0.67 
AML 0.62 0.76 0.3 0.74 0.72 
DOME 0.97 0.945062 1 0.7350 0.79 
FCAMapX 0.72 0.77 0.76 0.73 1 0.35 0.75 
Holontology 6 
KEPLER SOR EO) 0.79 0.75 
Lily KO D. .72 0.64 0.69 
LogMap MO ; 0.8 | 0.66) 0.82 
LogMapBio 63 0 6 0.77 
LogMapLt IRJ ERIE: .81 0.89 
POMAP++ WO 66 0.831063 0.77 
SANOM | .64 0.79 0.62 0.75 g o. 
XMap 0.65 0.68 0.83 0.78 ROR 0.72 0.81 0.78 0.75 0.83 0.78 


& TSR N 
Q & Ü g 8g & £ S 
SSIES LLI EES 
VV YO GF ee YT YF S 
ALIN 0.8 7 
ALOD2Vec WN i 0.87 0. 75 0. 6V 0.86 


AML 0.65 0. i l : 53 0.45 0.71 0.59 
DOME 0.87) 0: f 0.8 : 7 0.86 
FCA MapX RIERREN ; ; So .62 0.66 
Holontology MUNAA (0) 0.8 : j £ .64 0.72 
0.61 0. : | .57 0.62 

LogMap WINPE TAR i i 0.57 0.7 0.63 
LogMapLt 0.76 0.86) 0159) 0.86) 0: 0.62 0.39 0.7 0.51 
SyeUNLONYE 0.52 0.5 0.63 0.52 0. 0.5 0.63 0.51 1 


XMap 0.6 0.54 0.64 0.56 0. .52 0.54 0.51 0.66 0.56 0.61 


(e.g. label, comment, or type) and with additional information about the corre- 
spondence itself (e.g. residual match indicator or evaluation result). All files are 
available online for further analysis on correspondence level. 
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Fig. 2. Matcher comparison using MAD and F; on the Anatomy data set 
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Fig. 3. Matcher comparison using MAD and F; on the Conference data set 


5 Conclusion 


With MELT, we have presented a framework for ontology matcher development, 
configuration, packaging, and evaluation. We hope to lower the entrance barriers 
into the ontology matching community by offering a streamlined development 
process. MELT can also simplify the work of researchers who evaluate multiple 
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matchers on multiple data sets such as OAEI track organizers through its rich 
evaluation capabilities. 

The evaluation capabilities were exemplarily demonstrated for two OAEI 
tracks by providing a novel view on matcher similarity. The MELT framework 
as well as the code used for the analyses presented in this paper are open-source 
and freely available. 

Future work will focus on adding more evaluation possibilities in the form of 
further refiners and reasoners, providing more default matching functionalities 
such as modular matchers that can be used in matching pipelines, and developing 
visual evaluation support based on the framework to allow for better ontology 
matcher results comparisons. 
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Abstract. Biomedical knowledge graphs such as STITCH, SIDER, and 
Drugbank provide the basis for the discovery of associations between 
biomedical entities, e.g., interactions between drugs and targets. Link 
prediction is a paramount task and represents a building block for sup- 
porting knowledge discovery. Although several approaches have been pro- 
posed for effectively predicting links, the role of semantics has not been 
studied in depth. In this work, we tackle the problem of discovering inter- 
actions between drugs and targets, and propose SimTransE, a machine 
learning-based approach that solves this problem effectively. SimTransE 
relies on translating embeddings to model drug-target interactions and 
values of similarity across them. Grounded on the vectorial represen- 
tation of drug-target interactions, SimTransE is able to discover novel 
drug-target interactions. We empirically study SimTransE using state-of- 
the-art benchmarks and approaches. Experimental results suggest that 
SimTransE is competitive with the state of the art, representing, thus, an 
effective alternative for knowledge discovery in the biomedical domain. 


Keywords: Knowledge graphs - Embeddings - Similarity function 


1 Introduction 


The discovery of interactions among entities is one of the main link prediction 
tasks over knowledge graphs. Specifically, the problem of drug-target interaction 
discovery, i.e., proteins that are targets of drugs, is a crucial task, given the fact, 
that on average, bringing a new drug to the market, costs ~ $1.8 billion and 
takes more than 10 years. Several approaches have been defined to tackle the 
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Fig. 1. The Architecture. SimTransE receives an RDF knowledge graph and simi- 
larities among its entities. The output is a set of predicted interactions. 


problem of drug-target interaction discovery (e.g., [2,4]). Albeit effective, exist- 
ing approaches are not able to exploit the semantics encoded in the main features 
of the drugs or targets to enhance prediction. We present SimTransE approach 
that exploits both similarities between entities, e.g., drugs and target, as well as 
their connections in a knowledge graph. These features are considered by Sim- 
TransE to represent entities into a vector space. SimTransE is based on TransE, 
which utilizes the gradient descent optimization method to learn the embeddings 
based on relations stated in a knowledge graphs. Similarly, SimTransE optimizes 
the distance between embeddings, considering the existing interactions between 
drugs and targets, but additionally, SimTransE takes into consideration domain 
similarity values between drugs and between targets. Embeddings generated by 
SimTransE are utilized to predict new interactions by applying the homophily 
principle!. We conduct an empirical evaluation to assess the quality of Sim- 
TransE with respect to TransE and a benchmarks of interactions between drugs 
and targets. Our observed results suggest that considering similarity empow- 
ers SimTransE and allows for the discovery of interactions between drugs and 
targets that could be identified by baseline version of TransE. 


2 The SimTransE Approach 


After reviewing different approaches such as [2,4], we realize the benefits that 
integrating the entity-entity similarity (e.g., target-target, drug-drug, and target- 
drug) into a learning model can bring. The intuition behind this work is that vec- 
tor embedding-based approaches effectively combine different dimensions of the 
input data to learn embeddings. As a result, embeddings merge different dimen- 
sions of the data giving a multi-dimensional entity representation. We present 
SimTransE, an approach that maps each entity into multi-dimensional vector 
space considering entity-entity similarities to improve the results of the link pre- 
diction task. Thus, SimTransE is a vector embedding based machine learning 
model to learn a bipartite graph interactions and predict unknown interactions. 


2.1 Architecture 


The SimTransE architecture comprises a pipeline with three main components. 
Figure 1 shows the interaction between these components and the data flowing 


1 https: //en.wikipedia.org/wiki/Homophily. 
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among them. The Data Processor receives an RDF graph and creates dictionar- 
ies and matrices understandable by SimTransE. Three sets of entity dictionaries 
are created, i.e., left entities (the subjects), right entities (the objects), and rela- 
tional entities. These dictionaries are used throughout the pipeline to create 
vector embeddings. Secondly, two different sets of binary sparse matrices are 
created. One representing the positive and negative interactions of entities. 
Lastly, similarity matrices are built, i.e., given the m number of left entities and 
n number of right entities, we prepare two square matrices where the similarity 
score between entities from m to n are kept. The Model Trainer component 
receives as input the entity and interaction dictionaries and similarity matri- 
ces. The Model Trainer resorts to the stochastic gradient descent method 
to optimize the position and direction of the embeddings in a vector space. The 
Model Trainer uses interactions and similarities between entities to solve the 
optimization problem, and generates embeddings as output; (Table 1 shows the 
SimTransE interaction and objective functions). The Predictor component takes 
the generated embedding vectors, interactions, and thresholds. Using the embed- 
dings and thresholds, this component iterates over all the entities and identifies 
interactions of each entity with every other entity. The Predictor component cal- 
culates the precision and recall. Additionally, the Area Under Receiver (AUC) 
and the Area Under the Precision-Recall Curve (AUPRC) are calculated. 


2.2 Learning Vector Embeddings 


State-of-art approaches use only connectivity patterns between entities to learn 
the embeddings and perform predictions. Using just interactions among entities 
is not enough real-world applications where domain-specific knowledge plays a 
relevant role (e.g., during the prediction of drug-target interactions [8]). There 
are very few known interactions and the ratio of positive to negative classes is 
large, impacting, this, in the accuracy of the predictions. To tackle the problem 
of unbalance ratio of positive to negative classes, SimTransE incorporates not 
only entities interactions but similarities between entities during the learning 
process. SimTransE creates duplicate positive classes and adds a set of positive 
examples, which are generated using the similarity matrices. The similarity score 
is considered as the weight of example in the learning process. 

SimTransE? analyses the interactions and similarities between entities to 
learn the embeddings. SimTransE is based on the work “Translating Embeddings 
for Modelling Multi-relational Data” (TransE [1]). SimTransE intuition relies on 
the basic idea of TransE, i.e., if two entities interacts with each other, then the 
sum of first entity vector and relation vector should be approximately equal to 
the second vector. If there is no interaction between the two entities, the sum of 
first entity vector and relation vector will be far from the second entity vector. 
Using the same principle, SimTransE locates vectors using the similarities as 
well, and adds a new condition in the learning model that states similar entities 


? Algorithms are documented in our repository https: //github.com/RDF-Molecules/ 
SimTransE. 
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Table 1. SimTransE interaction and objective function to learn embeddings 


Interaction functions Objective functions 
h+lat if h interacts (1) t , , 
i: te an 3 [Y dh +e, t) -dh +e] 
ale x , otherwise (PNET (h tt VES h 4 44 
hi +l %h2, if hi similar h2 , , 
f bee A , Pana L= D E [d(h+£,t)—d(h +2,t )], 
1+1¢h2, otherwise (mb2)E8 (h! e,t JEST 0 2) 


should be closer than the dissimilar ones. Interactions are generated based on 
the homophily principle that states that similar entities tend to interact with 
similar entity. Further, we rely on thresholds captured from the meaning of the 
similarity metrics and to decide when two entities can be considered similar. 
Then the stochastic gradient descent optimization method is performed; a 
mini-batch of drug-target interactions is generated according to a training set S 
of interactions. The embeddings are updated during the learning process with 
two objective functions: (1) L; minimizes the distance whenever this is greater 
between actual and a corrupted triple with respect to the relation among them; 
and (2) L, minimizes the distance according to the similarity between the actual 
and self-generated similar triples. The learning process stops when reaching the 
total number of epochs, or depending on a threshold about the distance between 
the generated embeddings and the training set. 


2.3 Predicting Links 


The fundamental task of link prediction is to identify a relations between two 
entities. Yang et al. [9] define the link prediction formally as a task in a network 
G = (V, E) where V is the set of nodes and E is the set of edges. The main 
challenge to be achieved in this task is to predict whether there is or will be a 
link e(u,v) between a pair of nodes u and v € V and e(u,v) ¢ E. To perform 
link prediction, SimTransE uses the trained vector embeddings and calculates 
the distance of each entity to every other entity with respect to the relation 
between them. Based on this calculated distance and a given threshold, Sim- 
TransE decides if the input entities are or not related. SimTransE ranks each 
entity on the basis of distance and assigns a probability by comparing it with the 
distance of other entities. If this probability is greater than the given threshold, 
then SimTransE considers the link in the output. 

To evaluate link prediction we measure: Precision, the ratio of correctly pre- 
dicted interactions to total predictions; Recall, the ratio of correctly predicted 
interactions to expect predictions; Area under Precision-Recall Curve , 
we calculate the area under precision recall curve as the metric to evaluate 
our model, it does not consider true negatives since neither of both precision 
and recall consider true negatives; Finally, we measure the Area under ROC 
Curve, to evaluate our method since it works best when the problem of imbal- 
anced classes exist in the dataset [9] (Fig. 2). 
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3 Empirical Evaluation 


We empirically study the effectiveness of SimTransE on the problem of predicting 
links. We assess the following research questions: (RQ1) Is SimTransE able 
to perform as good as the state-of-the-art similarity measures? (RQ2) Does 
SimTransE perform well on the task of link prediction when applied to data 
with lots of connections? To answer these questions, we evaluate SimTransE on 
a state-of-the-art benchmark of drug-target interactions [8]; we report only on the 
results of interactions between drugs and targets of the type Nuclear Receptor 
and Ion Channel. TransE [1] is the baseline of the experiment. Additionally, 
we utilize the link prediction technique, SemEP [4], that extracts interaction 
from highly connected partitions of a knowledge graph; these interactions are 
utilized to enhance the set of input interactions. Furthermore, we compute the 
drug-drug and target-target similarity matrices; drug similarities are computed 
using SIMCOMP [3] while target similarities are computed using a normalised 
Smith-Waterman score [5]. 


Area under ROC Curve Area under ROC Curve 


W simTransE W simTransE+semEP $ semEP random $ transE W simTransE MB simTransE+semEP W semEP random M transe 


0.8176 


0.8084 


0.8025 0799 


(a) AUC results on Ion Channel (b) AUC results on Nuclear Receptor 


Fig. 2. SimTransE exhibits good performance in both datasets. 


Results and Discussion: From the output of SimTransE, we calculated: true 
and false positives and true and false negatives. From these values, we derived 
Precision, Recall, AUC, and AUPR?. We apply a blocking method on the gen- 
erated similarity-based interactions, through percentiles, i.e., four percentiles 
are considered: 80, 90, 95, and 100. Link prediction is validated following 10- 
fold cross-validation, and we report the mean across the results of the ten folds. 
Based on the observed outcomes, we can positively answer RQ1, i.e., SimTransE 
performs well on all the datasets, and outperforms the baseline method TransE 
in all cases. These results suggest that similarities between entities, e.g., drugs 
and targets, have a positive impact on both the learning process and the link 
prediction tasks. We observe, as well, that by increasing the number of connec- 
tions between drugs and target (e.g., by using SemEP results) the effectiveness 


3 Source code and formulas to calculate Precision, Recall, AUC, and AUPR are doc- 
umented in our repository https://github.com/RDF-Molecules/SimTransE. 
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of the approach improve even further. Few interactions are not predicted prop- 
erly although they are present in the training set. For most of them, we find 
that drugs and targets with few numbers of interactions are difficult to train 
for SimTransE. This situation is improved after using the interactions predicted 
from SemEP. Therefore, RQ2 is positively answered too. 


4 Conclusions 


In this paper, we presented SimTransE, a method to analyze interactions in 
knowledge graphs to predict links, based on the vectorization of the entities. To 
learn the embeddings, SimTransE uses not only the interactions among entities 
but also values of similarity between them. To test the accuracy of SimTransE, we 
compared its results against TransE, a prediction model for translational embed- 
dings that uses only interactions among entities. SimTransE exhibited high accu- 
racy and competitive result and outperformed TransE, one of the state-of-the-art 
approaches. The observed results suggest that combining interaction and simi- 
larity related semantics in the embeddings empowers the prediction model over 
knowledge graphs. In future work, we plan to conduct a more exhaustive evalu- 
ation to guarantee the reproducibility of the results, as well as the comparison 
with other embedding creation models, e.g., TransH [6] and TransG [7]. 
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Abstract. Information retrieval is regarded as pivotal to empower lay 
users to access the Web of Data. Over the past years, it achieved momen- 
tum with a large number of approaches being developed for different 
scenarios such as entity retrieval, question answering, and entity link- 
ing. This work copes with the problem of entity retrieval over RDF 
knowledge graphs using keyword factual queries. It discloses an app- 
roach that incorporates keyword graph structure dependencies through 
a conditional spread activation. Experimental evaluation on standard 
benchmarks demonstrates that the proposed method can improve the 
performance of current state-of-the-art entity retrieval approaches rea- 
sonably. 


1 Introduction 


Over the last years, information aplenty has been published as structured data. 
The Resource Description Framework (RDF)! became a standard format for 
many knowledge graphs (KG) publicly available such as DBpedia [18] and 
Wikidata [26]. An RDF KG organizes the information in the form of subject- 
predicate-object statements expressing semantic relations between entities (e.g. 
persons, organizations, and places) and concepts (e.g. given names, addresses, 
and locations). Currently, approximately 10.000 RDF KGs are available via pub- 
lic data portals.” Together, these graphs compose the so-called Linked Open Data 
Cloud (LOD). 

Ultimately, approaches designed to retrieve or use KG’s information has been 
getting substantial attention. Some of these approaches are Entity Retrieval 
(ER), Entity Linking (EL), Entity Disambiguation (ED), and Question Answer- 
ing (QA). ER specifies a category of information retrieval (IR) whereas the result 
of a natural language search query is an entity or an entity’s property rather than 
a document. ER methods play a fundamental role in IR on KGs. It enables lay 


1 http://www.w3.org/RDF. 
? http://lodstats.aksw.org/. 
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users to access KG’s information as well as other approaches on performing EL 
[7], ED [24,36], and QA [10,33,35] tasks. Improving ER methods can have a 
substantial impact on the whole IR chain. 

ER on RDF KG has peculiar characteristics that make it stand apart from 
standard document retrieval. The information in KG is structured in entities, 
attributes, classes, and their relationships. Exploring this structure makes ER a 
thriving research topic. Early approaches applied bag-of-word document retrieval 
techniques [4, 8,38]. The research has been shifted to explore the KG entities and 
concepts relations in fields, the field retrieval models [5]. Late studies focus on 
evaluating the word sequence and property-type influence [2,21,41]. Recently, 
the use of EL is being considered for ER improvement [14]. 

This work presents CACAO, a novel approach for ER on large? and diverse RDF 
KGs. It relies on a novel spread activation (SA) method to improve information 
access. SA is a method that iteratively propagates weights in a graph from one 
node to another [6]. It differs from the previous approaches by evaluating query’s 
intent on entities and concepts rather than fields and avoiding keyword over- and 
under-relatedness-estimation by accounting only the highly activated ones. The 
evaluation of the approach in two standard benchmarks shows an f-measure 
improvement of ~10%. 

The remaining of this work is organized as follows. Section2 defines RDF 
KG and states the problem. Section 3 describes the conditional spread activation 
model entitled CACAO. Section 4 presents the evaluation and discusses the results. 
Section 5 provides a literature overview on related work. Finally, Sect. 6 concludes 
giving an outlook on approach limitations and potential future work. 


2 Preliminaries 


An RDF KG can be regarded as a set of triples in the form of <s,p,o> € 
(IU B) x Px (IULUB) where: I is the set of all IRIs; B is the set of all 
blank nodes, BN I = 9; P is the set of all predicates, P C I; E is the set of all 
entities, E = IU B\ P; L is the set of all literals, and; R is the set of all resources 
R=IUBUPUL. In this graph, an entity type is specified by the property 
rdf:type while the label, by the property rdfs:label. A field of an entity is 
a predicate object f = <p,o> belonging to an entity triple <e, p, o>. The aim 
of entity retrieval is to recover the top-K ranked entities that best address the 
information need behind a given query as follows. 


Definition 1 (Problem Statement). Formally, a top-K entity retrieval takes 
a keyword query Q, an integer 0 < k, a set of entities E = {€1,€2,...,€)n)}, and 
returns the top-k entities based on a scoring function S(Q,e). 


3 The Approach 


CACAO is an ER approach to facilitate information access using keyword factual 
queries in RDF knowledge graphs. Factual queries are those whose intent can 


3 We define large KGs as those having over a billion facts. 
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be formalized by simple Basic Graph Patterns (BGP).* Entity retrieval on KGs 
has been a long-studied research topic for many years. Early approaches rely on 
bag-of-words models [4,8,38] that suffers from unrelatedness [5] and verbosity 
[29]. They were built under the assumption that the distribution of keywords 
is proportional to its subject relatedness [19]. This idea contradicts with the 
fact that people can describe things differently. Authors can be more descriptive 
or verbose than others. Particularly in case of DBpedia, editors’ experience or 
knowledge can unconsciously influence keyword frequency or even graph con- 
nectivity. To address the problem of verbosity, researchers proposed to score 
keywords normalized by the information (entity) length [29]. Other generation 
of ER approaches focused on the problem of unrelatedness by employing field 
retrieval models [5]. Late studies focused on evaluating how to weight fields dif- 
ferently so that to improve ER, accuracy [2,21,41]. Nevertheless, field retrieval 
models are unable to relate query keywords with a specific predicate or object 
because they are treated as one, a bag-of-(field-words). Recent approaches intro- 
duced the use of two stage techniques employing ER followed by an Entity Link 
Retrieval (ELR) [14]. 

CACAO addresses the ER problem in a different manner. It relies on a SA 
method that works in threefold. A query triggers an activation function that 
measures the relatedness of KG resources w.r.t. the query. The resource relat- 
edness values are then spread to their connected entities using a conditionally 
backward propagation, and, in a latter process, conditionally forward. The indi- 
vidual resource relatedness measurement addresses the problem of finding the 
query’s intent. The conditional propagation avoids the over- and the under- 
estimation of frequent and rare keywords. The next sections describes how the 
(1) Activation, (2) Conditional Backward Propagation and (3) Conditional For- 
ward Propagation works. 


3.1 Activation 


CACAO performs the activation in the resources. It uses the resource label coverage 
to evaluate its query relatedness. In this judgment, a query containing birth date 
should be more related to the property dbo:birthDate than to the property 
dbo:deathDate or dbpprop:date, while the query date should be more related 
with the property dbpprop:date than dbo:birthDate. Equation 1 formalizes the 
evaluation of the query label’s coverage. It receives as parameters the query Q 
and a resource label T represented by bit vectors. In these vectors, keywords 
are dimensions in which their occurrence are either zero or one. 


oQ, L) = 4 (1) 


Yet, the equation above cannot be used as an activation function, because it 
measures equally resources with the same query coverage rate. For the sake of 


‘For Basic Graph Pattern definition, visit http://www.w3.org/TR/rdf-sparql- 
query /#BasicGraphPatterns. 
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Fig. 1. Conditional Spread Activation on query “carrot cake ingredients”. 


illustration, let us take as an example the query “carrot cake”. For this query, 
either dbr:Carrot, dbr:Cake and dbr:Carrot_Cake are going to have the same 
coverage value of one, although dbr:Carrot_Cake has two overlapping keywords. 
Thus, full label-query overlaps are evaluated as the number of query keywords 
to the power of label keywords, (X` = Li. The incomplete overlaps are still 
considered, but treated with less importance. For those, the query-label inter- 
sects over their union suffices (Eq. 2). Equation 3 outlines the activation func- 
tion. Notice, however, two important properties. First, entities whose resources 
were activated for mere casualty will always valuate lower than the query length 
© Q). Second, it makes it easier to differentiate among resources with full and 
partial query coverage. 
=> 
I(Q,L) = > L kili + > 
X Liti Qi- Ligi 


if C(O, T) <1; 


i otherwise. 


(2) 


(3) 


ù 

ol 

yy Neca 
dl 


3.2 Conditional Backward Propagation 


Backward Propagation consists of distributing backward computed values 
through a network. It is used in neural networks to transfer the errors through- 
out the network’s layers [12]. In CACAO, the backward propagation is used to 
spread the resource’s activation values to their connected entities. By doing so, 
the approach computes implicitly the relatedness of the entity and its connected 
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resources to the query elements. However, the transfer is conditioned only to the 
most activated keyword value. It spreads from the resource to the fields, and, 
likewise, from the fields to the entity. This strategy prevents the frequency of 
the keywords on impacting the activation value while preserving their informa- 
tiveness. For example, the entity dbpedia:Aristotle contains either dbo:birthDate 
and dbo:deathDate. In this case, the keyword “date” will have twice more impact 
on dbpedia:Aristotle then in entites containing solenly one of the properties (e.g. 
dbo:birthDate, dbo:deathDate or dbpprop:date). 

Previous works demonstrate that scoring fields differently can improve the 
ER accuracy [2,21,41]. Hence, CACAO employs field weighting as described by 
Marx et al. [21]. Additionally, a query intent can be one or a set of entities. In 
the latter case, an important feature is the relevance ranking. As an example, 
the query “give me all persons” can return more than one million persons if 
applied to the DBpedia KG. But not all these entities may be relevant to the 
user. To deal with this problem, each activated entity receives a Page-Rank value 
normalized lower than a keyword weight. This work uses a modified version of 
PageRank [32] dubbed DBpedia Page-Rank which has been shown to produce 
better estimations [22]. 

Algorithm 1 describes the computation of the conditional backward propa- 
gation formally (Fig. 1a). It starts when function Af (Q, f, L®) receives a bit 
vector representing the query G, the field, and a set of processed keywords. The 
activation field value (ap) is initialized with 0, RÉ with 0, and Rf receives the 
field’s resource list. In sequel, the function iterates over Rf computing the activa- 
tion value af using the vectorized resource label returned by the function y: (r). 
In line 19, the function INSERT operates an insertion sort on the list set Ri. The 
insertion is performed in the ascending order of the resource’s activation value 
to ensure that only the highly activated keywords have their value transfered to 
the entity. Subsequently, an iteration operates over the resource sorted list Rf. 
The activation a, is now evaluated over the resource label after removing the 
keywords that were computed on previous iterations (LY ). In the last iteration 
instructions, the resource activation value is transferred to the field ay (line 25), 
and the resource keywords are added to the computed keyword list Te (line 
26). The function resums adding the field’s weight ¢(f) to the final activation 
value af (line 28). Notice that we did not discuss the use of stop words removal 
or tokenization to describe the algorithm because they are optional and does not 
influence the overall computation. 

The entity activation is computed over the fields’ activation as follows. The 
function At(@,e) receives a vectorized query Q and an entity e. The entity 
activation value a, is initialized with 0. The computed keywords T® and the 
field set F§ receives Ø. The fieldset R receives the list of entity fields. Simi- 
lar to the field activation function Af Q, f, Te), the entity activation consists 
in two iterations. The first (line 3) computes the field activation value ay on 
every field’s keyword, and uses an insertion sort function (line 5) to add them 
in Fy according to their inverse activation value. In this iteration, the computed 
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keywords parameter Te from the field activation function Af (Q, f, T°) 
receives an empty set (line 5), allowing it to compute the activation on every 
keyword. It then iterates over the sorted fields Fẹ (line 8) discarding the com- 
puted keywords, and transferring the field’s activation value to the entity, ae. 
The activation value then receives a normalized Page-Rank value returned by 
the w(e) function. 


3.3 Conditional Forward Propagation 


The forward propagation is only applied when a property contributes to the 
field’s activation. It forwards the entity activation to its activated properties, 
and from them to their objects. It results in objects having a higher activation 
value than their associated entity. Let us suppose that an user is looking for “car- 
rot cake ingredients”. In case of dbpedia:Carrot_Cake, the label activation will 
be backward propagated to the entity and then forwarded to the dbo:ingredient 
fields’ object herewith the property activation. Thus, the dbo:ingredients’ object 
on the BGP <dbpedia:Carrot_Cake dbo:ingredient ?object> is going to have 
a higher activation value then dbpedia:Carrot_Cake. The Fig. 1b shows the 
conditional forward propagation for our running example query “carrot cake 
ingredients” . 


4 Evaluation 


The evaluation was designed to measure the accuracy of CACAO compared to 
other ER, and Entity Linking methods. All output generated by the systems 
is publicly available at https://github.com/AKSW/irbench. There are several 
benchmark data sets that could be used on this task, including benchmarks 
from Semantic Search initiatives [13]° and QA Over Linked Data (QALD).® 
Semantic Search is based on user queries extracted from the YAHOO! search log, 
containing an average distribution of 2.2 words per-query. QALD provides both 
QA and keyword search benchmarks for RDF data. The QALD data sets are 
the most suitable due to the wide type of queries they contain and also because 
they make use of DBpedia, a very large and diverse KG. In this work, we use 
the QALD version 2 (QALD-2) data set benchmark from The Test Collection 
for Entity Search (DBpedia-Entity) [1], and; QALD version 4 (QALD-4) [34]. 
Table 1 shows the number of queries evaluated on each of them. 


4.1 Experimental Setup 


The evaluation contains two setups: The first setup evaluates CACAO against 
state-of-the-art Entity Retrieval (ER) using the QALD-2 from DBpedia-Entity. 
The second setup evaluates CACAO using state-of-the-art ER and Entity Linking 
Retrieval (ELR) for RDF data with the QALD-4. Both setups evaluate the 
approach with (CACAO+F) and without (CACAO) forward propagation. 


5 http://km.aifb.kit.edu/ws/semsearch10/. 
6 http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/. 
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Table 1. Number of queries evaluated on each of the benchmark data sets. 


Benchmark #Queries 
QALD-2 (DBpedia-Entity) | 140 
QALD-4 50 


Data: Q, the query. e, the entity. 

Result: The entity final activation value ae. 
=> 

Function A°(Q,e): 


1 
2 de <— 0; Te 0; Fy — @; F°, the fields in < e, p, o > triples; 
3 forall f € Fy do 
=> => 
4 aş ATO: f, L9); 
ay 
5 Le — ĝ; 
6 INSERT (Fy, f, af); 
7 end 
8 forall f € Fy do 
=> => 
9 de — de + AT(Q, f, LS); 
10 end 
11 ae — de + (Ee); 
12 return ae; 
13 end 


Data: f, the field. Q, the query. Le, the vector containing the already 
computed keywords. 
Result: The field score ap. 


=> => 
14 Function Af(Q, f, L°): 


15 ar — 0; RS <— Ø; R the list of resources in f; 
16 forall r € Rf do 

— SL 
17 L,— V” (r); 

=> > 

18 ar — A(Q, Lr); 
19 INSERT(RZ,r, ar); 
20 end 
21 forall r € RÉ do 

Pr VL 
22 L” — V” (r); 

— — — 
23 LY e L ^An LO; 

=> > 

24 ar — A(Q, LË); 
25 aş —af+aQr; 
26 Te -TOv TIY; 
27 end 
28 | aş—as+ olf); 
29 return af; 
30 end 


Algorithm 1. A Conditional Backward Propagation. 
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First Setup. The first setup evaluates CACAO against thirteen different ER mod- 
els distributed over three groups (Unstructured, Fielded and Other models) using 
the QALD-2 DBpedia-Entity data set benchmark. Results are reported using 
the benchmark standard evaluation metrics: Mean Average Precision (MAP) and 
Precision at rank 10 (P@10) [20]. The evaluated unstructured retrieval models 
use flattened entity representation: LM (Language Modeling) [27]; SDM (Sequen- 
tial Dependence Model) [23], and; BM25 [29]. Five retrieval models employed 
fielded entity representation: MLM (Mixture of Language Models) [25]; FSDM 
(Fielded Sequential Dependence Model) [41]; BM25F [5]; MLM-all, with equal 
field weights, and; PRMS (Probabilistic Model for Semistructured Data) [15]. The 
LTR (Learning-to-Rank) approach [3] employs 25 features from various retrieval 
models trained using the RankSVM algorithm. All EL (Entity Link) methods used 
TAGME [11] for annotating queries with entities, and an URI-only index (with 
a single catchall field) for computing the EL component. CA suffixes refer to 
models that are trained using Coordinate Ascent. 


Second Setup. The second setup extends the evaluation to QALD-4 bench- 
mark on ER and EL. It measures the performance of eight different Levenshtein, 
Jaccard, BMF25F and CACAO baseline scoring functions: Levenshteing uses the 
number of matched characters for each matched keyword; Levenshtein, uses 
the number of matched characters with the paraphrase disambiguation method 
proposed by Zhang et al. [40]; Jaccard, uses the Jaccard distance of matched 
resources per matched keyword; Jaccard, uses the disambiguating model imple- 
mented by Shekarpour et al. [31]; BMF25F is the ER method proposed by Blanco 
et al. [2], and; CACAOp¢s is the CACAO disambiguation model with rule 65 pro- 
posed by Shackapour et al. [31]’—J (Q, T) € [0.65,1]-—applied only to prop- 
erties. The idea is that there is a need to address inflections only on proper- 
ties where verbs occur rather than objects that usually contain proper names. 
The Levenshteina and Jaccarda methods are used to measure local keyword 
frequency without global occurrency normalization. CACAO and Glimmery: [2] 
performed all queries in OR mode. The performance considers only the top-K 
entries returned by each approach, where k equals to the number of entries in 
the target test query. The EL evaluation on Table5 evaluates the mentioned 
baseline functions as well as the last version of DBpedia Spotlight (version 1.0), 
AGDISTIS [36] and the state-of-the-art ED MAG [24] in simply BGP queries. 
This evaluation was designed to measure how accurate CACAO can be when deal- 
ing with approaches that use EL on factual keyword queries. We discard queries 
that can only be answered using classes and properties. We avoid the use of 
these queries because annotators usually can only handle entities. QALD-4 has 
ten queries that follow this criteria, Queries 12, 13, 21, 26, 30, 32, 34, 41, 42, 
and 44. All queries evaluated over DBpedia Spotlight used a refinement operator 
approach starting from confidence 0.5 in decreasing scale of 0.05 until reaching 
an annotation—when it was possible—or zero. AGDISTIS [36] and MAG [24] 
were evaluated over manually marked entity queries. 


T The optimal value is a range belonging to the interval (0.6, 0.7]. 
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Table 2. Mean Average Precision 
(MAP) achieved by different Entity 
Retrieval models on QALD-2 DBpedia- 


Entity benchmark data set. 
Approach MAP 
CACAO 0.2417 
BM25-CA_ 0.1939 
SDM+EL | 0.1887 
FSDM-+EL | 0.1719 
BM25F-CA | 0.1671 
LTR 0.1629 
LM+EL 0.1534 
SDM 0.1533 
LM 0.1424 
FSDM 0.1403 
MLM-CA 0.1273 
PRMS 0.1103 
BM25 0.1092 
MLM-all 0.1062 


Table 4. Precision, Recall and 
F\-measure achieved by different 
ER approaches on QALD-4 bench- 


mark data set. 


Table 


3. 


Precision 10 (P@10) 


achieved by different Entity Retrieval 
models on QALD-2 DBpedia-Entity 
benchmark data set. 


Table 5. Precision, 


measure 


Approach |P@10 
CACAO 0.3057 
BM25-CA_ | 0.2527 
LM+EL 0.2362 
SDM+EL | 0.2249 
LM 0.2144 
FSDM+EL | 0.2113 
BM25F-CA | 0.2053 
FSDM 0.2000 
SDM 0.1883 
PRMS 0.1871 
MLM-CA_ | 0.1844 
MLM-all 0.1843 
LTR 0.1732 
BM25 0.0986 

Recall and Fu- 
achieved by different EL 


Approach P R F: 

CACAO+F 0.19 0.19 | 0.19 
CACAO 0.11 0.11 | 0.11 
CACAO p65 0.09 | 0.09 | 0.09 
Levenshtein, 0.04 | 0.05 | 0.04 
BM25F [2] 0.03 | 0.03 | 0.03 
Jaccard, 0.01 0.04 |0.01 
Levenshtein, 0.00 | 0.00 | 0.00 
Jaccarda 0.00 | 0.00 | 0.00 


approaches on QALD 4 benchmark 

data set. 
Approach P R |F 
CACAO P65 1 1 1 
CACAO 0.90 | 0.90 | 0.90 
MAG [24] 0.80 | 0.80 | 0.80 
DBpedia Spotlight [7] | 0.70 | 0.70 | 0.70 
Levenshtein, 0.60 | 0.60 | 0.60 
Jaccard, 0.60 0.60 | 0.60 
AGDISTIS [36] 0.30 | 0.30 | 0.30 
BM25F [2] 0.30 | 0.30 | 0.30 
Levenshteing 0.00 | 0.00 | 0.00 
Jaccarda 0.00 | 0.00 | 0.00 
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Query and Resource Parsing. All implemented models (CACAO, CACAO+F, 
Jaccard and Levenshtein) perform the query and resource parsing extracting 
individual keywords, removing punctuation and capitalization as well as applying 
lemmatization. 


4.2 Results 


The results show that CACAO outperforms the state-of-the-art in both ER and 
EL tasks with keyword factual queries. It achieved +10% more accuracy than ER 
and EL approaches. Further, as expected, annotators performed better than ER 
on EL task. Tables 2 and 3 shows resp. the MAP and P@10 performance of CACAO 
compared to 13 methods. The tables show the score with a precision of four 
digits. It is possible to notice that MAP@10 scores considerably lower than P@10. 
That occurs because MAP is calculated on the average entry’s precision per ques- 
tion while P is computed only over matching entries. It means that although the 
entities are retrieved, their query rank can still be improved. Except for CACAO, 
some methods achieved different position in P@10 and MAP. The outcomes reveal 
that CACAO could produce more (1) precise and (2) complete results. In general, 
except SDM, the results confirm previous findings [14] that shows that CA and 
EL approaches could achieve better performance than their simple version— 
without—while EL versioned methods performed better than CA ones. CACAO 
could outperform previous methods because it acts as a resource linking app- 
roach. It evaluates resource dependencies rather than bi and trigrams keyword 
dependencies used in fielded approaches. It also suppresses SDM weakness of 
sorting entities in relevance order [41] using Page-Rank. 

Table 4 shows the Precision, Recall and F-measure achieved by each base- 
line models on QALD-4. CACAO achieved a better F-measure than CACAO p¢s 
mainly because it could overcome the problem of vocabulary mismatch on 
Query 29 by annotating the keyword “Australian” with dbpedia:Australia, and 
Query 49, by annotating the keyword “Swedish” with dbpedia:Sweden. As 
expected, methods empowered by disambiguation (Levenshtein, and Jaccard,) 
scores better than bag-of-words (Levenshtein, and Jaccard). Levenshteing 
scores better than Jaccard,, confirming previous research conclusion [40]. 
However, Jaccard, and Levenshtein, have their major drawbacks in the 
path disambiguation level. When retrieval scoring functions consider keywords 
equally weighted, they cannot disambiguate among resources containing the 
same keywords. For instance, in case an user query “places”, both prop- 
erty dbo:place and the entity-type dbo:Place can be equally weighted, lead- 
ing these models to retrieve either places as well as the entities connected 
to the property dbo:place. Not surprisingly, there was an issue related to the 
local® term frequency on BMF25F [2] model. On Query 30, it retrieves the 
entity dbpedia:Halloween_(Dave Matthews _Band_song) because the word “hal- 
loween” occurs more frequently than in the desired one (dbpedia:Halloween). 


8 Not to confuse with global term frequency. 
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Table 5 shows the EL evaluation over ten queries. There, CACAO pgs achieved 
the highest F-measure of 1. CACAO achieved an F-measure of 0.90, obtaining 
0.10% more accuracy than MAG, the third best-performing approach. CACAO 
annotates wrongly Query 21 keyword bach by dbpedia:Bachs. CACAO pgs applied 
65 rule only to the properties, assigning correctly dbpedia:Bach. MAG could not 
annotate correctly Query 34 and 44, and; DBpedia Spotlight Queries 12, 41, and 
42. The results expose a deficiency of EL systems in dealing with single entity 
factual queries. 

Entity Linking and Disambiguation approaches [7,24] exploit IR for find- 
ing the corresponding entity. For these systems, incomplete labels can lead to 
a non or an inconsistent annotation. For example, in our evaluation DBpe- 
dia Spotlight links the keyword “baldwin” in Query 47 with the entity 
dbpedia:Baldwin_Locomotive_Works. Other queries do not generate any anno- 
tation. That is the case of Query 36 whereas DBpedia Spotlight does not anno- 
tate it using confidence score 0.5, but annotates it wrongly using confidence 
0.45.° The use of the 65 rule, enhanced the results achieved by CACAO when 
applied to subjects, properties, and objects in comparison to when applied to 
only properties (CACAOp¢s), see Table4. This happens because it can help to 
annotate noun resources that are not handled by the lemmatization, i.e., Swe- 
den and Swedish on Query 43. However, the use of this method decreases the 
precision of the approach in Entity Linking task (see Table 5) because the 65 rule 
increases the possible overlaping resources leading to wrong annotations. That’s 
the case of Query 21. 


Complexity Analysis. In general, entity (document) retrieval algorithms can 
be implemented as an entity- or term-a-time. Entity-a-time retrieval algorithms 
aggregates scores over entities whereas term-a-time over terms. Term-a-time is 
the most common retrieval method and relies on posting lists implemented in 
popular IR frameworks such as Lucene. Intuitively, the complexity of term a time 
methods are bounded by the size of the posting list matching terms M’ and E’ 
matching entities insertions on a tree of size k (top-k) which leads to a complexity 
of O(M' + F' log k).!? Algorithm 1 display a naive implementation of our pro- 
posed entity-a-time method. The second For instruction (line 21) is bounded 
by the same time complexity of the number of the entity’s matched terms, giv- 
ing an overall collection complexity of O(M’). However, when considering the 
first loop (line 16), there is a need for calculating the activation value on every 
entity’s matched term, adding an extra complexity of matched term frequencies 
Tf’. Thus, the complexity of Algorithm 1 is at least Q(T f’ + M’ + F'log k), 
highlighting a future point of improvement. 


5 Related Work 


IR. Existing IR approaches commonly aim to retrieve the top-K ranked docu- 
ments for a given NL input query. Term Frequency-Inverse Document Frequency 


? With confidence 0.45 “pope” is annotated with the entity dbpedia:Pope. 
10 We ignored the existence of fields and resources for simplification. 
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(TF-IDF) [30] evaluates query keywords based on their local and global frequency. 
BM25 [28] extends TF-IDF introducing a document length normalization. Field- 
base extensions from bag-of-words have been proposed for IR on structured data. 
BM25F [5] is an extension of BM25 to retrieve structured data using different 
weighted fields. Mixture of Language Models (MLM) [25] extends the Language 
Model (LM) [27] using a linear combination of query keyword probability in a 
multi-field language model (MLM). Although individual field weights in BM25F 
and MLM can be tuned for a particular collection, they are fixed across different 
query keywords. Probabilistic Retrieval Model for Semistructured Data (PRMS) 
[16] overcomes this limitation using a probabilistic classification to map query 
keywords into fields. Other IR approaches extend field retrieval models adding 
keyword dependencies. The Markov Random Field (MRF) retrieval model [23] 
proposes three variants of keyword query dependencies: (1) full independence 
(FIM); (2) sequential dependence (SDM), and; full dependence (FDM). Zhiltsov 
et al. [41] proposed an fielded ER model based on unigrams and bigrams applied 
to five different fields (names, categories, similar entity names, related entity 
names, and other attributes). The model uses different field weights for ordered 
(e.g., keywords that appear consecutive ly) and unordered bigrams. Koumenides 
et al. Hasibi et al. [14] shows that entity linking can improve entity retrieval 
models. Asi et al. [17] gives a comprehensive overview of ER approaches. 


Semantic Web. Swoogle [9] introduces a modified version of PageRank that 
takes into consideration the types of the links between ontologies. Semplore [39], 
Falcons [4], and Sindice [8] explore traditional document retrieval for query- 
ing RDF data. YAHOO! BNC and Umass [13] were respectively the best and 
second best ER in SemanticSearch’10. YAHOO! BNC uses BM25F aplaying 
specific boosts on different fields (title, name, dbo:title, others). Blanco et al. 
[2] uses BM25F boosting important and unimportant fields differently. The pro- 
posed adaptation is implemented in the Glimmery; engine and is shown to out- 
perform other state-of-the-art methods on the task of ER. Virgilio et al. [37] 
introduced a distributed technique for ER on RDF data using MapReduce. The 
retrieval is carried out using only the high ranked (Linear) and all matched fields 
(Monotonic) strategies. Our work distinguish from the previous by (1) comput- 
ing the similarity on the individual resources and avoiding the over- and the 
under-estimation of frequent and rare keywords. 


6 Conclusion, Limitations and Future Work. 


Whereas recent ER systems gain more precision, retrieving the desired informa- 
tion still imposes a major challenge. This work presented a conditional activation 
approach for efficient ER over RDF KG using factual query interpretation. The 
results show a significant improvement of accuracy in comparison to the state- 
of-the-art ER and EL systems in standard benchmark data sets. In particular, 
CACAO shows an increase of ~10% on P@10 and MAP in standard ER benchmark 
data set. CACAO could outperform other ER and EL methods because it relies 
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on a model that combines two properties: (1) It is a resource-based rather than 
a fielded retrieval approach, and; (2) It performs a conditional activation that 
avoids the over- and the under-estimation of frequent and rare keywords. 

Nevertheless, there are a few challenges not addressed in the current 
implementation such as the keyword and character position as well as app- 
roach memory and runtime optimizations. Queries such as “peace and war” 
and “war and peace” can be activated equally. However, one can refer to 
dbpedia:Peace_and_War whereas the other to dbpedia:War_and_Peace. Recent 
works [41] have shown promising results in addressing this problem. The evalua- 
tion shows that current benchmarks do not address this issue. In future work, we 
plan to overcome the mentioned challenges. We see this work as the first step of 
a broader research agenda for designing more accurate ER systems over Linked 
Data. 
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Abstract. This paper describes an approach at Named Entity Recog- 
nition (NER) in German language documents from the legal domain. 
For this purpose, a dataset consisting of German court decisions was 
developed. The source texts were manually annotated with 19 seman- 
tic classes: person, judge, lawyer, country, city, street, landscape, orga- 
nization, company, institution, court, brand, law, ordinance, European 
legal norm, regulation, contract, court decision, and legal literature. The 
dataset consists of approx. 67,000 sentences and contains 54,000 anno- 
tated entities. The 19 fine-grained classes were automatically generalised 
to seven more coarse-grained classes (person, location, organization, 
legal norm, case-by-case regulation, court decision, and legal literature). 
Thus, the dataset includes two annotation variants, i.e., coarse- and 
fine-grained. For the task of NER, Conditional Random Fields (CRFs) 
and bidirectional Long-Short Term Memory Networks (BiLSTMs) were 
applied to the dataset as state of the art models. Three different models 
were developed for each of these two model families and tested with the 
coarse- and fine-grained annotations. The BiLSTM models achieve the 
best performance with an 95.46 Fy score for the fine-grained classes and 
95.95 for the coarse-grained ones. The CRF models reach a maximum of 
93.23 for the fine-grained classes and 93.22 for the coarse-grained ones. 
The work presented in this paper was carried out under the umbrella 
of the European project LYNX that develops a semantic platform that 
enables the development of various document processing and analysis 
applications for the legal domain. 


Keywords: Language technology - LT - Natural Language 
Processing - NLP - Named Entity Recognition - NER - Legal 
processing - Curation technologies - Legal technologies - BiLSTM - 
CRF 


1 Introduction 


Named Entity Recognition (NER) is the automatic identification of named enti- 
ties (NEs) in texts, typically including their assignment to a set of semantic cate- 
gories [19]. The established classes (for newspaper texts) are person PER, location 
LOC, organization ORG and other OTH [3,36,37]. Research on NER has a history 
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of more than 20 years and produced approaches based on linear statistical mod- 
els, e.g., Maximum Entropy Models [1,10], Hidden Markov Models [27], among 
others. Nowadays, the state of the art results are produced by methods such as 
CRFs [2,4,16,17] and BiLSTMs (9,20, 22, 26]. For English news documents, the 
best models have a performance of approx. 90 F, [9,20,22,26,29,38], while the 
best models for German are not quite as good with approx. 80 F, [2,4,16,22]. 
Based on their very good performance on news documents, we examine the use 
of CRFs and BiLSTMs in legal documents. 


1.1 Application and Project Context 


The objective of the project LYNX (Building the Legal Knowledge Graph for 
Smart Compliance Services in Multilingual Europe), a three year EU project 
that started in December 2017, is the creation of a legal knowledge graph that 
contains different types of legal and regulatory data.1 LYNX aims to help Euro- 
pean companies, especially SMEs, that already operate internationally, facing to 
offer and to promote their products and services in other countries. The project 
will eventually offer compliance-related services that are currently tested and 
validated in three use cases. The first pilot is a legal compliance solution, where 
documents related to data protection are innovatively managed, analysed, and 
visualised across different jurisdictions. In the second pilot, LYNX supports the 
understanding of regulatory regimes, including norms and standards, related 
to energy operations. The third pilot is a compliance solution in the domain 
of labour law, where legal provisions, case law, administrative resolutions, and 
expert literature are interlinked, analysed, and compared to define legal strate- 
gies for legal practice. The LYNX services are developed for several European 
languages including English, Spanish and German [32]. 

Documents in the legal domain contain multiple references to NEs, especially 
NEs specific to the legal domain, i.e., jurisdictions, legal institutions, etc. Most 
NER solutions operate in the general or news domain, which makes them not 
completely suitable for the analysis of legal documents, because they are unable 
to detect domain-specific entities. The goal is to make knowledge workers, who 
process and make use of these documents, more efficient and more effective in 
their day to day work, this also includes the analysis of domain-specific NEs, see 
[5,31] for related approaches in the area of content curation technologies. 


1.2 Research Questions 


This article is dedicated to the recognition of NERs and their respective cate- 
gories in German legal documents. Legal language is unique and differs greatly 
from newspaper language. This also relates to the use of person, location and 
organization NEs in legal text, which are relatively rare. It does contain such 
specific entities as designations of legal norms and references to other legal docu- 
ments (laws, ordinances, regulations, decisions, etc.) that play an essential role. 


1 http: //www.lynx-project.eu. 
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Despite the development of NER for other languages and domains, the legal 
domain has not been exhaustively addressed yet. This research also had to face 
the following two challenges. (1) There is no uniform typology of semantic con- 
cepts related to NEs in documents from the legal domain; correspondingly, uni- 
form annotation guidelines for NEs in the legal domain do not exist either. (2) 
There are no freely available datasets consisting of documents from the legal 
domain, in which NEs have been annotated. 

Thus, the research goal is to examine NER with a specific focus on German 
legal documents. This includes the elaboration of the corresponding concepts, 
the construction of a dataset, developing, evaluating and comparing state of the 
art models for NER. We address the following research questions: 


1. Which state of the art approaches are in use for NER? Which approaches have 
been developed for NER in legal documents? Do these approaches correspond 
to the state of the art? 

2. Which NE categories are typical for legal documents? Which classes are to be 
identified and classified? Which legal documents can be used for a dataset? 

3. What performance do current models have? How are different categories rec- 
ognized? Which categories are recognized better than others? 


2 Related Work 


NER in the legal domain, despite its high relevance, is not a well researched area. 
Existing approaches are inconsistent with regard to the applied methods, tech- 
niques, classifications and datasets, which makes it impossible to compare their 
results adequately. Nevertheless, the developed approaches make an important 
contribution and form the basis for further research. 

The first work in which NER in the legal domain was explicitly defined as a 
term was described by Dozier et al. [13]. The authors examined NER in US case 
law, depositions, pleadings and other legal documents, implemented using simple 
lookups in a list of NEs, contextual rules, and statistical models. Taggers were 
developed for jurisdiction, court, title, document type (e.g., brief, memorandum), 
and judge. The jurisdiction tagger performed best with an F; of 92. The scores 
of the other taggers were around 82-85. 

Cardellino et al. developed a tool for recognizing, classifying, and linking legal 
NEs [8]. It uses the YAGO and LKIF ontologies and elaborated four different 
levels of granularity: NER, NERC, LKIF and YAGO. A Support Vector Machine, 
Stanford NER [17] and a neural network (NN) were trained and evaluated on 
Wikipedia and decisions of the European Court of Human Rights. The best 
result on the Wikipedia dataset was achieved by the NN with F; scores for 
the NERC and YAGO classes of 86 and 69, respectively. For the LKIF classes, 
Stanford NER was better with F, score of 77. The performance was significantly 
worse on decisions. The F; scores varied according to the model and the level of 
granularity. Stanford NER was able to achieve a maximum F score of 56 with 
the NERC classes. 
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Glaser et al. tested three NER systems [18]. The first, GermaNER [4], recog- 
nized person, location, organization and other. Temporal and numerical expres- 
sions were recognized using rule-based approaches, and references using the app- 
roach described in Landthaler et al. [23]. The second system was DBpedia Spot- 
light [11,28], developed for the automatic annotation of DBpedia entities. The 
third system, Templated, was designed by Glaser et al. [18]. It focused on NER 
in contracts created using templates. For GermaNER and DBpedia Spotlight a 
manually annotated corpus was created, which consisted of 500 decisions of the 
8th Civil Senate of the German Federal Court of Justice and had reference to 
tenancy law. GermaNER and DBpedia-Spotlight were evaluated on 20 decisions 
from the created dataset and Templated was evaluated on five different contracts. 
GermaNER and DBpedia Spotlight achieved an F; of 80 and 87, respectively. 
The result of Templated NER was 92 Fy. 

To adapt categories for the legal domain, the set of NE classes was redefined 
in the approaches described above. Thus, Dozier et al. [13] focused on legal NEs 
(e.g., judge, lawyer, court). Cardellino et al. [8] extended NEs on NERC level 
to document, abstraction, and act. It is unclear what belongs to these classes 
and how they were separated from each other. Glaser et al. [18] added reference 
[23]. However, this was understood as a reference to legal norms, so that further 
references (to decisions, regulations, legal literature, etc.) were not covered. 

The research of NER in legal documents is also complicated by the fact 
that there are no freely available datasets, neither for English nor for German. 
Datasets for newspaper texts, which were developed in CoNNL 2003 or GermEval 
2014, again are not suitable in terms of the type of text and the annotated enti- 
ties. In this context, the need for a manually annotated dataset consisting of legal 
texts is enormous, requiring the development of a classification of legal categories 
and uniform annotation guidelines. Such a dataset consisting of documents from 
the legal domain would make it possible to implement NER with state of the 
art architectures, i.e., CRF and BiLSTM, and to analyze their performance. 


3 A Dataset of Documents from the Legal Domain 


3.1 Semantic Categories 


Legal documents differ from texts in other domains, and from each other in terms 
of text-internal, and text-external criteria [7,12, 15,21], which has a huge impact 
on linguistic and thematic design, citation, structure, etc. This also applies to 
NEs used in legal documents. In law texts and administrative regulations, the 
occurrence of typical NEs such as person, location and organization is very low. 
Court decisions, on the other hand, include these NEs, and references to national 
or supranational laws, other decisions, and regulations. Two requirements for a 
typology of legal NEs emerge from these peculiarities. First, the categories used 
must reflect those entities that are typical for decisions. Second, a typology must 
concern the entities whose differentiation in decisions is highly relevant. 
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Domain-specific NEs in legal documents can be divided into two basic groups, 
namely designations and references. For legal norms (i.e., for laws and ordi- 
nances) designations are headings for their standard legal texts, which provide 
information on rank and content [6, Rn. 321 ff.]. Headings are uniform and usu- 
ally consist of a long title, short title and abbreviation, e.g., the title of the 
Medicinal Products Act of 12 December 2005 ‘Gesetz über den Verkehr mit 
Arzneimitteln (Arzneimittelgesetz - AMG)’ (Federal Law Gazette I p. 3394). 
The short title ‘Arzneimittelgesetz’ and the abbreviation ‘AMG’ are in brackets. 
The citation of the legal norms is also fixed. There are different citation rules for 
full and short citations [6, Rn. 168 ff.]. The designation and citation of binding 
individual acts such as regulations or contracts is not uniformly defined. 

For our dataset consisting of court decisions, a total of 19 fine-grained classes 
were developed, which are based on seven coarse-grained classes (see Table 1). 
As a starting point, the well-researched newspaper domain was used for the 
elaboration of the typology. The annotation guidelines are based on the ACE 
guidelines [25] and NoSta-D Named-Entity [3]. The core NEs are typical classes 
like PER, LOC, and ORG, which are split into fine-grained classes.? The coarse- and 
fine-grained classifications correlate such that, e.g., the coarse-grained class of 
person PER under number 1 in Table 1 contains the fine-grained classes of judge 
RR, lawyer AN and other person PER (plaintiffs, defendants, witnesses, appraisers, 
etc.) under numbers 1 to 3. The location LOC includes the fine-grained classes of 
country LD (countries, states and city-states), city ST (cities, villages and com- 
munities), street STR (streets, squares, avenues, municipalities and attractions) 
and landscape LDS (continents, mountains, lakes, rivers and other geographical 
units). The coarse-grained class organization ORG is divided into public/social, 
state and economic institutions. They form the fine-grained classes of organiza- 
tion ORG, institution INN, and company UN. Designations of the federal, supreme, 
provincial and local courts are summarized in the fine-grained class court GRT. 
Furthermore, brand? MRK is a separate category. 

A fundamental peculiarity of the published decisions is that all personal infor- 
mation is anonymised on account of data privacy reasons. This applies primarily 
to person, location and organization. NEs are replaced by letters (1) or dots (2). 


(1) ... das Land B. LD 
(2) ... unter der Firma C... AG UN 


In addition to the typical categories, other classes specific to legal documents, 
i.e., court decisions, are also included in the categories. These are the coarse- 
grained classes of legal norm NRM, case-by-case regulation REG, court decision RS 


? The coarse- and fine-grained classes PER and ORG are different despite their identical 
abbreviations. 

3 From an onomastical point of view, brand belongs to object NEs which also con- 
tain the coarse-grained class of organization. Despite terminological and typological 
inaccuracy, brand was intentionally categorized as a fine-grained class of organization 
and not as independent coarse-grained class (see Table 1). 
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and legal literature LIT. The legal norm and case-by-case regulation include NEs 
(3) and references (4), but the court decision and legal literature only references 
(5). Legal norm NRM is subdivided according to legal force into the fine-grained 
classes law GS, ordinance VO and European legal norm EUN. Case-by-case regu- 
lation REG, on the other hand, contains binding individual acts that are below 
each legal standard. These include the fine-grained classes regulation VS (admin- 
istrative regulations, directives, circulars and decrees) and contract VT (public 
service contracts, international treaties, collective agreements, etc.). The last 
two coarse-grained classes, court decision RS and legal literature LIT, do not 
have any fine-grained classes. RS reflects references to decisions, and LIT sum- 
marizes references to legal commentaries, legislative materials, legal textbooks 
and monographs. 


(3) ... ist nach Maßgabe der Gründe mit dem vereinbar. 


(4) Mit der Neuregelung in paz 


(5) 


LIT 


3.2 Dataset Statistics and Distribution of Semantic Categories 


The dataset Legal Entity Recognition (LER) consists of 750 German court deci- 
sions published online in the portal ‘Rechtsprechung im Internet’.* The source 
texts were extracted from the XML documents and split into sentences and words 
by SoMaJo [30]. The annotation was performed manually by one Computational 
Linguistics student using WebAnno [14]. In terms of future work we plan to add 
annotations from two to three linguists so that we can report inter-annotator 
agreement. The dataset” is freely available for download under the CC-BY 4.0 
license®, in CoNLL-2002 format. Each line consists of two columns separated by 
a space. The first column contains a token and the second a tag in IOB2 format. 
The sentence boundary is marked with an empty line. 

The dataset consists of 66,723 sentences and 2,157,048 tokens. The percentage 
of annotations (per-token basis) is approx. 19%. Overall, the dataset includes 
53,632 annotated NEs. The dataset has two variants for the classification of 
legal NEs (Table 1). The person, location and organization make up 25.66% of all 
annotated instances. 74.34% are specific categories like the legal norm NRM, case- 
by-case regulation REG, court decision RS and legal literature LIT. The largest 
classes are the law GS (34.53%) and court decision RS (23.46%). Other entities, 
i.e., ordinance, European legal norm, regulation, contract, and legal literature, are 
less common (between 1 and 6% of all annotations). 


* http: //www.rechtsprechung-im-internet.de. 
5 https: //github.com/elenanereiss/Legal-Entity- Recognition. 
6 https: //creativecommons.org/licenses/by /4.0/deed.en. 
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Table 1. Distribution of coarse- and fine-grained classes in the dataset 


Coarse-grained classes # % Fine-grained classes % 
1| PER | Person 3,377 | 6.30 1 PER | Person 1,747 | 3.26 
2 |RR Judge 1,519 | 2.83 

3 | AN Lawyer 111 | 0.21 
2| LOC | Location 2,468 | 4.60 | 4 LD Country 1,429 | 2.66 
5 | ST City 705 | 1.31 

6 | STR. | Street 136 | 0.25 
7 | LDS Landscape 198 | 0.37 
3| ORG | Organization 7,915 | 14.76 |8 | ORG | Organization 1,166 | 2.17 
9 | UN Company 1,058 | 1.97 
0|INN | Institution 2,196 | 4.09 
1| GRT | Court 3,212 | 5.99 
2| MRK Brand 283 | 0.53 
4| NRM | Legal norm 20,816 | 38.81 3| GS Law 18,520 | 34.53 
4| VO Ordinance 797 | 1.49 
5| EUN | European legal norm 1,499 | 2.79 
5| REG | Case-by-case regulation | 3,470 | 6.47 6 | VS Regulation 607 | 1.13 
TINT Contract 2,863 | 5.34 
6| RS Court decision 12,580 | 23.46 8| RS Court decision 12,580 | 23.46 
7 | LIT Legal literature 3,006 | 5.60 9 | LIT Legal literature 3,006 | 5.60 
Total 53,632 | 100 Total 53,632 | 100 


4 Evaluation and Results 


We used two tools for sequence labeling for our experiments: sklearn-crfsuite” 
and UKPLab-BiLSTM [35]. In total, 12 models were tested, i.e., three CRF and 


BiLSTM models with coarse- and fine-grained classes. For CRFs, the following 


groups of features and sources were selected and manually developed: 


1. F: features for the current word in a context window between —2 and +2, 


which are case and shape features, prefixes, and suffixes; 
2. G: for the current word, gazetteers of persons from Benikova et al. [4]; 


gazetteers of countries, cities, streets, landscapes, and companies from GOV- 


DATA®, the Federal Agency for Cartography and Geodesy? and Datendi- 
eter.de!°; gazetteers of laws, ordinances and administrative regulations from 
the Federal Ministry of Justice and Consumer Protection'!:!?. A detailed 
description of the gazetteers can be found in the Github project; 


3. L: lookup table for the word similarity in a context window between -2 and 


+2 as in Benikova et al. [4], which contains the four most similar words to 
the current word. 


T https: //sklearn-crfsuite.readthedocs.io. 
8 https://www.govdata.de/apps/- /details/liste-der-staatennamen. 
® https: //www.bkg.bund.de/DE/Produkte-und-Services /Shop-und-Downloads/ 

Digitale- Geodaten/Geographische- Namen/geographische-namen.html. 


10 https: //www.datendieter.de. 


11 https: //www.gesetze-im-internet.de. 
12 http: //www.verwaltungsvorschriften-im-internet.de. 
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Three models were designed to chain these groups of features and gazetteers: 
(1) CRF-F with features; (2) CRF-FG with features and gazetteers; and (3) 
CRF-FGL with features, gazetteers, and the lookup table; the model names 
reflect the three groups. As a learning algorithm, the L-BFGS method is used 
with L1 and L2 regularization parameters, set to the coefficient 0.1. The maxi- 
mum number of iterations for optimizing the algorithm is set to 100. 

For BiLSTM we also use three models: (1) BiLSTM-CRF [20]; (2) BiLSTM- 
CRF+ with character embeddings from the BiLSTM [22]; (3) BiLSTM-CNN- 
CRF with character embeddings from CNN [26]. As hyperparameters we used 
the values that achieved the best NER performance according to Reimers and 
Gurevych [34]. The BiLSTM models have two BiLSTM layers, each with a size 
of 100 units and a dropout of 0.25. The maximum number of epochs is 100. At 
the same time, the tool uses pre-trained word embeddings for German [33]. 

The results were measured with the micro-precision, -recall and -Fı measures. 
In order to reliably estimate their performance, we evaluated the models using 
stratified 10-fold cross-validation. The dataset is shuffled, sentence-wise, and 
divided into ten mutually exclusive partial sets of similar size. One iteration 
uses one set for validation and the rest for training. We iterate ten times, so that 
each part of the dataset is used nine times for training and once for validation. 
The distribution of NEs in the training and validation set remain the same over 
the iterations. The cross-validation prevented overfitting during training and the 
stratification prevented measurement errors in unbalanced data. 


4.1 CRF Models 


For the fine-grained classes, CRF-FGL achieved the best performance with an 
Fı score of 93.23 (Table 2). The recognition of legal NEs in the different classes 
had varied levels of success depending on the model. Lawyer, institution, court, 
contract and court decision reached the highest F; with CRF-F. With the CRF- 
FG better results could be achieved for judge, city, regulation and legal literature. 
This means that the gazetteers have had a positive impact on the recognition of 
these NEs. The remaining classes performed better with CRF-FGL. The concate- 
nation of gazetteers and the lookup table for the word similarity has improved 
the results, but not as much as expected. 

For the coarse-grained classes, the CRF-FG and CRF-FGL together achieved 
the best result with an F; value of 93.22 (Table 3). However, person was recog- 
nized better with CRF-FG and location and organization better with CRF-FGL. 
CRF-FG achieved the best result in the case-by-case regulation and court deci- 
sion. With CRF-FGL, the values in the legal norm and legal literature increased. 
Compared to the fine-grained classes, the better balanced precision and recall 
were observed and the F; increased by max. 0.1 per model. 


4.2 BiLSTM Models 


For the fine-grained classes, two models with character embeddings have achieved 
the best result with an F score of 95.46 (Table 4), confirming the positive impact 
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Table 2. Precision, recall and F; values of CRF models for fine-grained classes 
Fine-grained classes | CRF-F CRF-FG CRF-FGL 
Prec | Rec | Fy Prec | Rec | Fy Prec | Rec | Fy 

Person 89.41 | 83.53 86.32 | 90.50 | 83.54 | 86.83 | 90.44 | 84.22 | 87.18 
Judge 98.22 | 97.62 97.92 | 98.68 | 97.75 | 98.21 | 98.55 | 97.75 | 98.14 
Lawyer 93.14 | 76.84 | 83.73 | 89.81 | 73.51 | 80.39 | 92.17 | 75.04 | 81.99 
Country 96.73 | 90.42 93.44 | 97.03 | 91.98 | 94.40 | 96.93 | 92.62 | 94.70 
City 88.99 | 77.37 82.70 | 88.27 | 81.77 | 84.77 | 88.09 | 81.82 | 84.67 
Street 88.69 | 59.58 | 70.51 | 87.51 | 57.95 | 68.90 | 90.50 | 59.85 | 71.30 
Landscape 94.34 | 61.14 73.43 | 92.63 | 64.09 | 75.25 | 93.33 | 65.27 | 76.08 
Organization 86.82 | 71.25 | 78.20 | 86.71 | 71.95 | 78.56 | 88.84 | 72.72 | 79.89 
Company 92.77 | 86.04 | 89.21 | 93.00 | 86.18 | 89.39 | 93.54 | 86.85 | 90.01 
Institution 92.74 | 89.49 91.07 | 92.88 | 89.20 | 90.98 | 92.51 | 89.47 | 90.96 
Court 97.23 | 96.35 | 96.78 | 97.03 | 96.35 | 96.69 | 97.19 | 96.33 | 96.75 
Brand 85.85 | 56.91 | 67.85 | 90.33 | 56.20 | 68.82 | 88.40 | 58.07 | 69.61 
Law 96.86 | 96.34 | 96.60 | 97.00 | 96.44 | 96.72 | 97.02 | 96.56 | 96.79 
Ordinance 91.91 | 82.23 86.79 | 91.35 | 82.85 | 86.87 | 91.41 | 83.49 | 87.26 
European legal norm | 89.37 | 86.07 87.67 | 88.91 | 85.49 | 87.14 | 89.41 | 86.21 | 87.76 
Regulation 83.83 | 71.38 | 77.00 | 84.34 | 71.03 | 77.02 | 84.42 | 70.66 | 76.85 
Contract 90.66 | 87.72 | 89.15 | 90.18 | 87.42 | 88.76 | 90.53 | 87.67 | 89.06 
Court decision 93.35 | 93.39 93.37 | 93.22 | 93.34 | 93.28 | 93.21 | 93.29 | 93.25 
Legal literature 92.98 | 91.28 | 92.12 | 92.94 | 91.42 | 92.17 | 92.79 | 91.28 | 92.02 
Total 94.28 | 91.85 | 93.05 | 94.31 | 91.96 | 93.12 | 94.37 | 92.12 | 93.23 


Table 3. Precision, recall and F; values of CRF models for coarse-grained classes 


Coarse-grained classes | CRF-F CRE-FG CRF-FGL 
Prec | Rec | Fy Prec | Rec | Fy Prec | Rec | Fy 

Person 94.20 | 89.43 | 91.74 | 94.54 | 89.99 | 92.20 | 94.22 | 90.20 | 92.16 
Location 94.60 | 84.55 | 89.26 | 93.89 | 85.48 | 89.45 | 94.33 | 86.45 | 90.18 
Organization 92.82 | 89.00 | 90.87 | 93.02 | 89.08 | 90.99 | 93.23 | 89.10 | 91.11 
Legal norm 96.19 | 95.16 | 95.67 | 96.29 | 95.26 | 95.77 | 96.28 | 95.44 | 95.86 
Case-by-case regulation | 89.29 | 84.72 | 86.94 | 89.28 | 84.77 | 86.96 | 88.76 | 84.15 | 86.39 
Court decision 93.19 | 93.26 | 93.23 | 93.28 | 93.23 | 93.25 | 93.08 | 93.08 | 93.08 
Legal literature 92.72 | 91.15 | 91.92 | 92.99 | 91.14 | 92.06 | 93.11 | 91.13 | 92.11 
Total 94.17 | 92.07 | 93.11 | 94.26 | 92.20 | 93.22 | 94.22 | 92.25 | 93.22 


of character level information. A significant improvement with an increase in 
F, by 5-16 (compared to the BiLSTM-CRF without character embeddings) 
was found in organization, company, ordinance, regulation and contract. Judge 
and lawyer were recognized better by about 1 with the BiLSTM-CRF. Person, 
country, city, court, brand, law, ordinance, European legal norm, regulation and 


Fine-Grained Named Entity Recognition in Legal Documents 281 


Table 4. Precision, recall and F; values of BiLSTM models for fine-grained classes 


Coarse-grained classes | BiLSTM-CRF BiLSTM-CRF+ BiLSTM-CNN-CRF 
Prec | Rec | Fy Prec Rec | Fy Prec Rec | Fy 
Person 89.30 | 91.08 | 90.09 | 90.78 | 92.24 | 91.45 | 90.21 92.57 | 91.35 
Judge 98.64 | 99.48 | 99.05 | 98.37 | 99.21 | 98.78 | 98.18 99.01 | 98.59 
Lawyer 94.85 | 84.62 | 88.19 | 86.18 | 90.59 | 87.07 | 88.02 87.96 | 87.11 
Country 94.66 | 95.98 | 95.29 | 96.52 | 96.81 | 96.66 | 95.09 | 97.20 | 96.12 
City 81.26 | 86.32 | 83.48 | 82.58 | 89.06 | 85.60 | 83.21 | 87.95 85.38 
Street 81.70 | 75.94 | 78.10 | 81.82 | 75.78 | 77.91 | 86.24 78.21 | 81.49 
Landscape 78.54 | 79.08 | 77.57 | 78.50 | 80.20 | 78.25 | 80.93 81.80 80.90 
Organization 79.50 | 74.72 | 76.89 | 82.70 | 80.18 | 81.28 | 84.32 81.00 | 82.51 
Company 85.81 | 81.34 | 83.44 | 90.05 | 88.11 | 89.04 | 91.72 89.18 | 90.39 
Institution 88.88 | 90.91 | 89.85 | 89.99 | 92.40 | 91.17 | 90.24 92.23 91.20 
Court 97.49 | 98.33 | 97.90 | 97.72 | 98.24 | 97.98 | 97.52 | 98.34 | 97.92 
Brand 78.34 | 73.11 | 75.17 | 83.04) 76.25 | 79.17 | 83.48 | 73.62 | 77.79 
Law 96.59 | 97.01 | 96.80 | 98.34 | 98.51 | 98.42 | 98.44 98.38 | 98.41 
Ordinance 82.63 | 72.61 | 77.08 | 92.29 | 92.96 | 92.58 | 91.00 | 91.09 | 90.98 
European legal norm | 90.62 | 89.79 | 90.18 | 92.16 | 92.63 | 92.37 | 91.58 92.29 91.92 
Regulation 75.58 | 68.91 | 71.77 | 85.14 | 78.87 | 81.63 | 79.43 | 78.30 | 78.74 
Contract 87.12 | 85.86 | 86.48 | 92.00 | 92.64 | 92.31 | 90.78 | 92.06 91.40 
Court decision 96.34 | 96.47 | 96.41 | 96.70 | 96.73 | 96.71 | 97.04 97.06 | 97.05 
Legal literature 93.87 | 93.68 | 93.77 | 94.34 | 93.94 | 94.14 | 94.25 94.22 | 94.23 
Total 93.80 | 93.70 | 93.75 | 95.36 | 95.57 | 95.46 | 95.34 95.58 95.46 


contract were identified better with the BiLSTM-CRF-+, and street, landscape, 
organization, company, institution, court decision and legal literature with the 
BiLSTM-CNN-CRF. Dependencies of the results on character embeddings pro- 
duced by BiLSTM and CNN were also found. Brand, ordinance and regulation 
benefited significantly from the use of the BiLSTM. However, recognition of 
street and landscape improved with the character embeddings from the CNN. 

For the coarse-grained classes, Fı increased by 0.3—0.9 per model, and preci- 
sion and recall were also more balanced (Table 5). The best result was produced 
by the BiLSTM-CRF-+ with 95.95. The model had the highest values of more 
than 90 F; in almost all classes. An exception was the BiLSTM-CNN-CRF in 
organization, which increased Fy by 0.3. 
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Table 5. Precision, recall and F; values of BiLSTM models for coarse-grained classes 
Coarse-grained classes | BiLSTM-CRF BiLSTM-CRF+ BiLSTM-CNN-CRF 
Prec | Rec | Fy Prec | Rec | Fy Prec | Rec | Fy 
Person 94.34 | 95.16 | 94.74 | 94.82 | 96.03 | 95.41 | 94.09 | 96.21 | 95.12 
Location 90.85 | 92.59 | 91.68 | 92.60 | 94.05 | 93.31 | 91.74 | 93.45 | 92.57 
Organization 91.82 | 90.94 | 91.37 | 92.87 | 92.89 | 92.87 | 93.80 | 92.65 | 93.21 
Legal norm 97.04 | 96.50 | 96.77 | 97.93 | 98.04 | 97.98 | 97.71 | 97.87 | 97.79 
Case-by-case regulation | 86.79 | 84.15 | 85.43 | 90.72 | 90.53 | 90.61 | 90.11 | 90.80 | 90.43 
Court decision 96.54 | 96.58 | 96.56 | 96.93 | 97.05 | 96.99 | 96.73 | 96.83 | 96.78 
Legal literature 93.78 | 93.91 | 93.84 | 94.23 | 94.62 94.42 | 94.24 | 93.80 | 94.02 
Total 94.86 | 94.49 | 94.68 | 95.84 | 96.07 | 95.95 | 95.71 | 95.87 | 95.79 


4.3 Discussion 


The BiLSTMs achieved superior performance compared to the CRFs. They 
produced good results even with the fine-grained classes covered poorly in the 
dataset. The CRF models, on the other hand, delivered values that were about 
1-10 lower per class. In addition, some classes are characterized by bigger differ- 
ences in precision and recall, indicating certain weaknesses of the CRFs. In par- 
ticular, the recognition of street and brand with the BiLSTM models improved 
by values of at least 10. The values for lawyer, landscape and ordinance also 
increased by a value of 5. 

The results also show that the two model families exhibit a similar perfor- 
mance due to the dataset or structure of the data. The models produce their best 
results with 95 F; score in the fine-grained classes judge, court and law. On the 
one hand, this depends on a smaller number of types compared to tokens in judge 
and court. On the other hand, the precise identification of law can be explained 
by its good coverage in the dataset and uniform citation. Incorrect predictions 
about boundaries are made if references had a different form such as in ‘§ 7 des 
Gesetzes (gemeint ist das VersAnstG)’ instead of common ‘§ 7 VersAnstG’, ‘das 
zwolfte Kapitel des neunten Sozialgesetzbuches’ instead of ‘das Kapitel 12 des 
SGB XII’. There were also incorrect classifications of terms as a NE containing 
the word ‘law’, such as ‘federal law’, ‘law of experience’, ‘criminal law’, etc. The 
recognition of country, institution, court decision, and legal literature was also 
very good with scores higher than 90 F,. This is also due to a smaller number of 
types in country, institution and uniform references of court decision and legal 
literature. 

However, the recognition of street, landscape, organization and regulation 
is the lowest throughout, amounting to 69-80 with the CRF and 72-83 with 
the BiLSTM models, caused by inconsistent citation styles. The recognition of 
street and landscape is poor because they are covered in the dataset with only 
about 200 instances, but heterogeneously represented. The worst result, i.e., a 
maximum F; value of 69.61 with the CRFs and of 79.17 with the BiLSTMs, 
was observed in brand. These NEs were also expressed in different contexts, 
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such as the brand NE ‘Einstein’s Garage’ and the scientist Albert Einstein. It 
can be concluded that the differences in the recognition of certain NEs is firstly 
due to the unbalanced class distribution and secondly to the specifics of the legal 
documents, in particular because of the coverage in the corpus, the heterogeneity 
with regard to the form of names or references as well as the context. 

Overall, the CRFs and BiLSTMs perform very well, producing state of the 
art results, which are significantly better than comparable models for newspaper 
text. This fact can, first, be explained by the size of the dataset which is larger 
than other NE datasets for German. Second, the form of legal NEs, which also 
includes references, differs a lot from NEs in newspaper text. The distribution 
of designations or references in the dataset consisting of documents from the 
legal domain is greater compared to person, location or organization. Third, the 
strictly regulated linguistic and thematic design (repeated use of NEs per one 
decision, repeated use of formulaic, template-like sentences, etc.) and the uni- 
form reference style have had a positive impact on performance. The applied 
evaluation method made it possible to reliably estimate performance for unbal- 
anced data. Unfortunately, it is not possible to compare our results with other 
systems for NER in legal documents because they are not freely available. 


5 Conclusion 


We describe and evaluate a set of approaches for the recognition of semantic 
concepts in German court decisions. In line with the goals, the characteristic and 
relevant semantic categories such as legal norm, case-by-case regulation, court 
decision and legal literature were worked out and a dataset of legal documents 
was built, instances of a total of 19 semantic classes were annotated. For the 
experiment, CRF and BiLSTM models were selected that correspond to the state 
of art, and tested with the two sets of classes. The results of both model families 
demonstrate the superiority of the BiLSTMs models with character embeddings 
with an F; score of 95.46 for the fine-grained classes and 95.95 for the coarse- 
grained classes. We found that the structure of the data involved in the training 
process strongly impacts the performance. To improve NER, it is necessary to 
extend or optimize the unbalanced data. This helps to minimize the specific 
influencing factors of the legal documents on models. Our results show that there 
is no universal model that recognizes all classes in the best way. Accordingly, an 
even better universal system could be built as an ensemble of different models 
that perform well for particular classes. 
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Abstract. Knowledge Graph completion deals with the addition of 
missing facts to knowledge graphs. While quite a few approaches exist 
for type and link prediction in knowledge graphs, the addition of literal 
values (also called instance or entity attributes) is not very well covered 
in the literature. In this paper, we present an approach for extracting 
numerical and date literal values from Wikipedia abstracts. We show 
that our approach can add 643k additional literal values to DBpedia at 
a precision of about 95%. 


Keywords: Knowledge Graph - Completion - Literals - DBpedia 


1 Introduction 


In the past, adding missing facts to Knowledge Graphs to increase the data 
quality in knowledge graphs has gained a lot of attention [10]. Most prominently, 
link prediction using embedding models is a very active field of research [14]. 
While a lot of research is devoted on missing links, i.e., relations between 
two entities, the prediction of missing facts involving literals (e.g., numbers or 
dates), is considerably underrepresented in the current research landscape [10]. 
In this paper, we aim at closing this gap by identifying and extracting literal 
values from abstracts in Wikipedia!., defining an abstract as the In contrast to 
standard relation extraction, there are a few additional challenges to face: 


— Natural text uses a lot of different number formats (e.g., w.r.t. decimal and 
thousands separators) [15]. Even within a single Wikipedia article, the number 
formats may be inconsistent [11]. 

— Numbers often come with units of measurement, which complicate the extrac- 
tion, since those units need to be harmonized [13]. 

— Exact numbers are often rounded in natural text (e.g., about 3,000 instead of 
3,085, which can make it difficult to assess whether a rounded and an exact 
number refer to the same or a different fact. 


The contribution of this paper is an approach for extracting literal values (num- 
bers and dates) from Wikipedia articles, which can deal with roundings and 
different units of measurement. 


1 We follow the long abstract notion in [9], extracting the “text before a table of 


contents” from a Wikipedia page. 
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2 Related Work 


There are not many approaches for completing literal values in knowledge graphs. 
In [12], the use of Web tables as a source for literal values is explored. This setting 
is different, since the authors work on structured, not unstructured data, so they 
can use different features. 

One of the few closer approaches is presented in [2], where the authors run 
open relation extraction on Wikipedia text and perform an a posteriori map- 
ping to the DBpedia ontology. Although they do not deal with numbers, their 
approach can extract date literals and reaches an overall precision of 74.3%?. 

Similarly, the authors of [1] train specific models for DBpedia relations, but 
only evaluate their approach on two date-valued and one integer-valued relation. 
They extract 440k date valued literals (birthdate and deathdate) at a precision 
of 91%, and 237k integer-valued literals (population) at a precision of 70%°. 

In comparison to that state of the art, the approach discussed in this paper 
yields superior results both in absolute numbers of values extracted, as well as 
in precision. 

For error detection in knowledge graphs, there are approaches based on out- 
lier detection [3], probabilistic data modeling [7] and data fusion [8]. There, it can 
also be observed that the amount of research directed towards literal values is 
much underrepresented in comparison to relation assertions between individuals. 


3 Approach 


Our approach builds on previous works for extracting relation assertions from 
Wikipedia abstracts [4,5]. That approach exploits links in Wikipedia abstracts, 
learns characteristic patterns for relations (e.g., The first place linked in the 


Kassel amar oa 
State Hesse 
sees J Admin. region Kassel 
From Wikipedia, the free encyclopedia District Urban district 
Government 
For other uses, see Kassel (disambiguation) e Major ne 


Areal”! 
* City 106.8 km? (41.2 sq mi) 
Elevation 167 m (548 ft) 

Population (2017-12-32 


Kassel (German pronunciation: [ kasl] (q listen); spelled Cassel until is-acity located 
on the Fulda River in northern Hesse, Germany. It is the administrative seat of the 
Regierungsbezirk Kassel and the district of the same name and hadZ00,507) abitan 


in December@015)T he former capital of the state of Hesse-Kassel has many palaces * Density Sah 
and parks, including the Bergpark Wilhelmshöhe, which is a UNESCO World Heritage + Metro 450,000 
Site. Kassel is also known for the documenta exhibitions of contemporary art. Kassel has iene sae reeled 
Postal codes 34001-34134 
a public university with25,00))students @018)and a multicultural population (39)% of Dialling codes 0561 
icle 
the citizens i @017) ad a migration background). registration 
Website www. stadt-kassel.de® 


Fig. 1. Example from Wikipedia with a correct and an incorrect example extracted, 
as well as non-matching literals marked in the abstract. 


? Their final dataset contains about 5k date literals mapped to properties in the DBpe- 
dia ontology. 
3 The totals include both literals contained and not contained in DBpedia. 
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Wikipedia abstract about a person is that person’s birthplace), and then applies 
the models to extract new statements (e.g., new facts for the relation birthplace). 
For each relation, a separate model is trained and validated on the existing 
instances, which allows for only applying models that achieve a desired precision. 


3.1 Training Data Creation 


To create the training data, we use regular expressions to detect and parse num- 
bers in the abstract in various formats (i.e., thousands and decimal separators), 
and SpaCy* and dateparser® to detect and parse dates. 

With these approaches, we extract the sets of numerical literals N and date 
literals D from the Wikipedia abstract describing a DBpedia entity e. Since 
numbers may be rounded, we accept a training example n € N as positive 
example for a relation if there is a statement r(e, v) in DBpedia with n € [v-(1— 
p),v-(1+p)] for a deviation factor of p. We manually examined candidates drawn 
at deviation factors of 1%, 1.5%, and 2%, and observed that the precision at 1% 
and 1.5% was 65%, and dropped to 60% when further increasing the deviation 
factor. Hence, we decided to use a factor of 1.5% in our further experiments. 

Figure 1 illustrates this generation of examples. Since DBpedia is constructed 
from infoboxes in Wikipedia, the values in the infobox on the right hand side 
correspond to the values in DBpedia. Given the Wikipedia abstract, 200,507 
would be extracted as a training example for the relation population (correct), 
while 1928 would be extracted as a training example for the relation density 
(incorrect). The deviation is 0.11% and 1.47%, respectively. 

Since dates are not rounded, training data for date valued literals are based 
on exact matches with DBpedia only°®. 

As negative training examples, we use all numbers or dates, respectively, 
which have been tagged in the abstract which are not identified as positive 
examples for the relation at hand. In the example depicted in Fig. 1, we would 
use all numbers except for 200,507 as negative training examples for the relation 
population. 


3.2 Unit Conversion 


An initial look at the training data revealed that this approach misses quite a few 
numerical training examples, since the units of measurement in which the facts 
are stored are often different from the ones in the abstracts. For example, areas 
(of countries, cities, ...) are stored in DBpedia in square meters, while they are 
typically written in square kilometers or non-metric units. Therefore, for those 
relations, the training data sets create are often very small (e.g., for area, which 
is one of the most frequent relations in DBpedia, we initially collected less than 
100 training examples). 


4 https: //spacy.io/. 
5 N r > 
https://pypi.org/project /dateparser /. 
6 Note that it is not trivial to detect that 1928 in the text is a date, not an integer. 
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Table 1. Examples for unit conversions learned from the data. 


Token Target unit | Correct factor Inferred factor | R squared 
km? m? 1,000,000 997,097 0.9949 
km2 m? 1,000,000 999,927 0.9999 
ha m? 10,000 9,467 0.8987 
pupils $ = 13,613 0.9062 
kilometers | m 1,000 973 0.9347 
century m -= 73,453 0.9421 


Therefore, we decided to enhance the training example generation with unit 
conversion. We follow the assumption that (1) units of measurement are typi- 
cally the token after a number’, and (2) the function for converting units to their 
standard unit in DBpedia is usually a simple multiplication by a factor. Thus, 
we group numeric literals for each relation by the token following the number 
(e.g., ha) and try to learn a regression model for that token. From those regres- 
sion models, we derive unit conversion rules which are applied to the literals 
extracted as above before mapping them to relations in DBpedia. Following an 
initial inspection of the data, we accept unit conversions learned on at least 100 
examples and having a coefficient of determination of at least 0.85. Table 1 shows 
a few example unit conversion factors, including useful rules learned, but also 
some misleading rules (e.g., converting the “unit” pupils to $). 


3.3 Feature Extraction 


For each positive and negative training example extracted, we create a set of 
features to feed into a classifier. We use a similar set of features as in [4], e.g., 
position in the sentence, position of the sentence in the abstract, etc., plus a bag 
of words representation of the sentence in which the literal is located, and, for 
numerical literals, the deviation from the mean divided by the standard deviation 
of all values of the respective relation, in order to discard outliers. 


3.4 Model Building 


To learn models given the examples and feature vectors, we experimented 
with different classifiers from the scikit-learn library®, i.e., SGD, Naive Bayes, 
SVM, Decision Trees, Random Forest, Extra Trees, Bagging Decision Trees, 
and XGBoost. Out of those, the latter five delivered the best results in an ini- 
tial experiment (using split validation on a sample of relations with the most 
already existing instances), without much variance in quality. Random Forests 
were chosen because of a good trade-off between runtime and accuracy. 


T There are rare exceptions, like currencies, which we ignore. 
8 https: //scikit-learn.org/. 
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4 Evaluation 


For our evaluation, we used the most recent downloadable version of DBpedia, 
i.e., DBpedia 2016-10° and the abstracts contained therein'’. We tried to train 
models for all 405 number and date valued properties in the DBpedia ontology. 
To learn meaningful models, we discarded all properties that were too small (i.e., 
less than 100 positive examples), leaving us with 120 properties. 

Following the approach in [4], we aimed at achieving a high precision in the 
extraction in order not to add too much noise to the knowledge graph at hand. 
Therefore, we validated all models internally using a training (75%) and a test 
(25%) split, and kept only those models achieving a precision of at least 95%. 
Out of those 120 properties, we could learn a model at 95% precision in 28 cases. 

As shown in Table 2, for those 28 relations, the approach creates almost 9M 
statements, however, only a smaller fraction (about 7%) is not yet contained in 
DBpedia. That share of new statements is considerably higher for dates (11%) 
than for numbers (less than 1%). The majority of the former are birthdates, the 
majority of the latter are population numbers. 


Table 2. Number of statements extracted at 95% precision according to internal 
validation. 


Range | Properties | Statements | New statements 
Date | 17 5,525,089 | 621,747 
Int 6 224,606 15,326 
Float | 5 3,185,497 5,955 
Total | 28 8,955,192 | 643,030 


In order to validate the precision values of the internal validation based on the 
test set, we randomly sampled 500 of the new statements for manual inspection. 
This inspection yields a precision of 94.2%, which confirms the estimation based 
on the internal test set. 

In terms of runtime, a complete run on the entire DBpedia and the corre- 
sponding Wikipedia abstracts takes about 135 h on a Linux server with 512 GB of 
RAM. The by far longest time is consumed by the preprocessing of the abstracts, 
e.g., the date tagging and parsing takes 65h alone, whereas the model training 
and statement creation take 1.9 and 3.6h each. 


° https: //wiki.dbpedia.org/downloads- 2016-10. 
10 http: //downloads.dbpedia.org/2016-10/core-il8n/en/long_abstracts_en.tql.bz2. 
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5 Conclusion and Outlook 


With this paper, we have aimed at closing a gap in the current research landscape 
on knowledge graph completion. While research in this field is strongly focused 
on type and relation prediction, we have shown how numeric and date valued 
facts can be extracted from Wikipedia abstracts. While there are quite a few 
challenges, including number and date formats and unit conversions, we have 
shown that it is possible to achieve an extraction at a precision of about 95%. 
The code used to create the results reported in this paper is available online!!. 

In the future, we plan to apply the approach to other Wiki-based knowledge 
graphs, such as DBkWik [6]. 
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Abstract. Over the last two decades, the amount of data which has 
been created, published and managed using Semantic Web standards and 
especially via Resource Description Framework (RDF) has been increas- 
ing. As a result, efficient processing of such big RDF datasets has become 
challenging. Indeed, these processes require, both efficient storage strate- 
gies and query-processing engines, to be able to scale in terms of data 
size. In this study, we propose a scalable approach to evaluate SPARQL 
queries over distributed RDF datasets using a semantic-based partition 
and is implemented inside the state-of-the-art RDF processing frame- 
work: SANSA. An evaluation of the performance of our approach in 
processing large-scale RDF datasets is also presented. The preliminary 
results of the conducted experiments show that our approach can scale 
horizontally and perform well as compared with the previous Hadoop- 
based system. It is also comparable with the in-memory SPARQL query 
evaluators when there is less shuffling involved. 


1 Introduction 


Recently, significant amounts of data have been created, published and managed 
using the Semantic Web standards. Currently, the Linked Open Data (LOD) 
cloud comprises more than 10000 datasets available online! using the Semantic 
Web standards. RDF is a standard that represents data linked as a graph of 
resources following the idea of the linking structure of the Web and using URIs 
for representation. 

To facilitate better maintenance and faster access to this scale of data, effi- 
cient data partitioning is needed. One of such partitioned strategies is semantic- 
based partitioning. It groups the facts based on the subject and its associated 
triples. We want to explore and evaluate the effect of semantic-based partitioning 
on query performance when dealing with such a volume of RDF datasets. 


1 http://lodstats.aksw.org/. 
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SPARQL is a W3C standard query language for querying data modeled as 
RDF. Querying RDF data efficiently becomes challenging when the size of the 
data increases. This has motivated a considerable amount of work on designing 
distributed RDF systems able to efficiently evaluate SPARQL queries [6, 20,21]. 
Being able to query a large amount of data in an efficient and faster way is one 
of the key requirements for every SPARQL engine. 

To address these challenges, in this paper, we propose a scalable semantic- 
based distributed approach? for efficient evaluation of SPARQL queries over 
distributed RDF datasets. The main component of the system is the data par- 
titioning and query evaluation over this data representation. 

Our contributions are: 


— A scalable approach for semantic-based partitioning using the distributed 
computing framework, Apache Spark. 

— A scalable semantic-based query engine (SANSA.Semantic) on top of Apache 
Spark (under the Apache Licence 2.0). 

— Comparison with state-of-the-art engines and demonstrate the performance 
empirically. 

— Integration with the SANSA [13]? framework. 


The rest of the paper is structured as follows: Our approach for data mod- 
eling, data partitioning, and query translation using a distributed framework 
are detailed in Sect.3 and evaluated in Sect.4. Related work on the SPARQL 
query engines is discussed in Sect.5. Finally, we conclude and suggest planned 
extensions of our approach in Sect. 6. 


2 Preliminaries 


Here, we first introduce the basic notions used throughout the paper. 


Apache Hadoop and MapReduce. Apache Hadoop is a distributed frame- 
work that allows for the distributed processing of large data sets across a cluster 
of computers using the MapReduce paradigm. Beside its computing system, it 
contains a distributed file system: the Hadoop Distributed File System (HDFS), 
which is a popular file system capable of handling the distribution of the data 
across multiple nodes in the cluster. 


Apache Spark. Apache Spark is a fast and generic-purpose cluster comput- 
ing engine which is built over the Hadoop ecosystem. Its core data structure 
is Resilient Distributed Dataset (RDD) [25] which are a fault-tolerant and 
immutable collections of records that can be operated in a parallel setting. 
Apache Spark provides a rich set of APIs for faster, in-memory processing of 
RDDs. 


? https://github.com/SANSA-Stack/SANSA- Query /tree/develop/sansa-query- 
spark/src/main/scala/net /sansa_stack/query /spark/semantic. 
3 http: //sansa-stack.net /. 
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Data Partitioning. Partitioning the RDF data is the process of dividing 
datasets in a specific logical and/or physical representation in order to ease faster 
access and better maintenance. Often, this process is performed for improving 
the system availability, load balancing and query processing time. There are 
many different data partitioning techniques proposed in the literature. We choose 
to investigate the so-called semantic-based partitioning behaviors when dealing 
with large-scale RDF datasets. This partitioned technique was proposed in the 
SHARD [17] system. We have implemented this technique using in-memory pro- 
cessing engine, Apache Spark for better performance. A semantically partitioned 
fact is a tuple (S, R) containing pieces of information R € (P, O) about the same 
S where S$ is a unique subject on the RDF graph and R represents all its asso- 
ciated facts i.e predicates P and objects O. 


3 Approach 


In this section, we present the system architecture of our proposed approach, the 
semantic-based partitioning, and mapping SPARQL to Spark Scala-compliant 
code. 


3.1 System Architecture Overview 


The system architecture overview is shown in the Fig. 1. 


Audi :memeberOf Volkswagen 4 
Ingolstadt :eityOf Germany 


Joy :owns Carl :livesIn Bonn 
Carl :typeOf Car :madeBy Audi :madeIn Ingolstadt 
Bonn :cityOf Germany 


ex:workPage 


SPARQL 


-E3 


RDF Data 


ta Ingestion 
RDF Layer 


Joy :owns Carl 
Joy :livesIn Bonn 
Carl rtypeof Car 


Carl :madeBy Audi 

Carl :madeIn Ingolstadt 
Bonn :eityOf Germany 
Audi :memeberOf Volkswagen 
Ingolstadt :cityOf Germany 


Fig. 1. System architecture overview. 


It consists of three main facets: Data Storage Model, SPARQL Query Frag- 
ments Translator, and Query Evaluator. Below, each facet is discussed in more 
details. 
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Data Storage Model. We model the RDF data following the concept of RDDs. 
RDDs are immutable collections of records, which represent the basic building 
blocks of the Spark framework. RDDs can be kept in-memory and are able to 
operate in parallel throughout the Spark cluster. We make use of SANSA [13]’s 
data representation and distribution layer for such representation. 

First, the RDF data (see Step 1 as an example) needs to be loaded into a 
large-scale distributed storage (Step 2). We use Hadoop Distributed File-System 
(HDFS)*. We choose HDFS as Spark is capable of performing operations based 
on data locality in order to choose the nearest data for faster and efficient com- 
putation over the cluster. Second, we partition (Step 3) the data using semantic- 
based partitioning (see Step 4 as an example of such partition). Instead of work- 
ing with table-wise representation where the triples are kept in the format of 
RDD <Triple>, data is partitioned into subject-based grouping (e.g. all enti- 
ties which are associated with a unique subject). Consider the example in the 
Fig. 1 (Step 2, first line), which represents two triples associated with the entity 
Joy: 

Joy :owns Cari :livesIn Bonn 


This line represents that the entity Joy owns a car entity Car1, and that Joy 
lives in Bonn. 

Often flattening data is considered immature with respect to other data rep- 
resentation, we want to explore and investigate if it improves the performance 
of the query evaluation. We choose this representation for the reason of easy- 
storage and reuse while designing a query engine. Although, it slightly degrades 
the performance when it comes to multiple scans over the table when there 
are multiple predicates involved in the query. However, this is minimal, as Spark 
uses in-memory, caching operations. We will discuss this on the Sect. 4 into more 
detail. 


SPARQL Query Fragments Translation. This process generates the Scala 
code in the format of Spark RDD operations using the key-value pairs mech- 
anism. With Spark pairRDD, one can manipulate the data by splitting it into 
key-value pairs and group all associated values with the same keys. It walks 
through the SPARQL query (Step 4) using the Jena ARQ? and iterate through 
clauses in the SPARQL query and bind the variables into the RDF data while 
fulfilling the clause conditions. Such iteration corresponds to a single clause with 
one of the Spark operations (e.g. map, filter, reduce). Often this operation needs 
to be materialized i.e the result set of the next iteration depends on the previ- 
ous clauses and therefore a join operation is needed. This is a bottleneck since 
scanning and shuffling is required. In order to keep these joins as small as possi- 
ble, we leverage the caching techniques of the Spark framework by keeping the 
intermediate results in-memory while the next iteration is performed. Finally, the 


t https: //hadoop.apache.org/docs/r1.2.1/hdfs_design.html. 
5 https: //jena.apache.org/documentation/query/. 
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Algorithm 1. Spark parallel semantic-based query engine. 
input : q: a SPARQL query, input: an RDF dataset 
output: result an RDD - list of result set 


/* Loading the graph */ 
1 graph = spark.rdf(lang)(input) 

/* Partitioning the graph. See algorithm 2 for more details. */ 
2 partitionGraph — graph.partitonAsSemanticGraph() 

/* Querying the graph. See algorithm 3 for more details. */ 


3 result — partitionGraph.sparql(q) 
4 return result 


Spark-Scala executable code is generated (Step 5) using the bindings correspond- 
ing the query. Besides simple BGP translation, our system supports UNION, 
LIMIT and FILTER clauses. 


Query Evaluator. The mappings created as shown in the previous section can 
now be evaluated directly into the Spark RDD executable code. The result set 
of these operations is distributed data structure of Spark (e.g. RDD) (Step 6). 
The result set can be used for further processing and visualization using the 
SANSA-Notebooks (Step 7) [5]. 


3.2 Distributed Algorithm Description 


We implement our approach using the Apache Spark framework (see 
Algorithm 1). It constructs the graph (line 1) while reading RDF data and con- 
verts it into an RDD of triples. Later, it partitions the data (line 2, for more 
details see Algorithm 2) using the semantic-based partitioning strategy. Finally, 
the query evaluator is constructed (line 3) which is detailed in Algorithm 3. 

The partition algorithm (see Algorithm 2) transforms the RDF graph into 
a convenient SP (line 2). For each unique triple in the graph in a distributed 
fashion, it does the following: It gets the values about subjects and objects 
(line 3) and local name of the predicate (line 4). It generates the key-value pairs 
of the subject and its associated triples with predicate and object separated 
with the space in between (line 5). After the mapping is done, the data is 
grouped by key (in our case subject) (line 6). Afterward, when this information 
is collected, the block is partitioned using the map transformation function of 
Spark to refactor the format of the lines based on the above information (line 
7). 

This SPARQL query rewriter includes multiple Spark operations. First, par- 
titioned data is mapped to a list of variable bindings satisfying the first basic 
graph pattern (BGP) of the query (line 2). During this process, the duplicates 
are removed and the intermediate result is kept in-memory (RDD) with the vari- 
able bindings as a key. The consequent step is to iterate through other variables 
and bind them by processing the upcoming query clauses and/or filtering the 
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Algorithm 2. partitonAsSemanticGraph: Semantic-based partition 
algorithm. 
input : graph: an RDD of triples 
output: partionedData: an RDD of partitions 
partitonedData — 
foreach V!triple € graph &&  triple.getSubject 4 do 
s — triple.getSubject; o <— triple.getObject 
p — triple.getPredicate.getLocalName 
partitonedData + = (s, p+” ” +0+” ”) 


partitonedData.reduceByK ey(_ + -) 
map f > (f-1+? ” + f2) 
return partitonedData 


aN awn 


Algorithm 3. sparql: Semantic-based query algorithm. 

input : partitonedData: an RDD of partitions 

output: result an RDD of result set 

foreach p € partitionedData do 

1stVariable — assignVariablesF or1stClaues() 

foreach i € getClauses() do 
iVariable — assignV ariablesF oriClaues() 
mapResult — mapByK ey(getCommonV ariables()) 
joinResult — join(mapResult) 


joinResult. filter (getSelectV ariables()) 
| result — result.join(joinResult) 


return result 


o oN OAR WN 


other ones unseen on the new clause. These intermediate steps perform Spark 
operations over both, the partitioned data and the previously bound variables 
which were kept on Spark RDDs. 

The ith step discovers all variables in the partitioned data which satisfy the 
ith clause appeared and keep this intermediate result in-memory with the key 
being any variable in the ith step which has been introduced on the previous 
step. During this iteration, the intermediate results are reconstructed in the 
way that the variables not seen in this iteration are mapped (line 5) with the 
variables of the previous clause and generate a key-value pair of variable bindings. 
Afterward, the join operation is performed over the intermediate results from the 
previous clause and the new ones with the same key. This process iterates until 
all clauses are seen and variables are assigned. Finally, the variable binding (line 
7) to fulfill the SELECT clause of the SPARQL query happens and returns the 
result (line 8) of only those variables which are present in the SELECT clause. 
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4 Evaluation 


In our evaluation, we observe the impact of semantic-based partitioning and 
analyze the scalability of our approach when the size of the dataset increases. 

In the following subsections, we present the benchmarks used along with the 
server configuration setting, and finally, we discuss our findings. 


4.1 Experimental Setup 


We make use of two well-known SPARQL benchmarks for our experiments: the 
Waterloo SPARQL Diversity Test Suite (WatDiv) v0.6 [3] and Lehigh Univer- 
sity Benchmark (LUBM) v3.1 [8]. The dataset characteristics of the considered 
benchmarks are given in Table 1. 

WatDiv comes with a test suite with different query shapes which allows us 
to compare the performance of our approach and the other approaches. In par- 
ticular, it comes with a predefined set of 20 query templates which are grouped 
into four categories, based on the query shape: star-shaped queries, linear-shaped 
queries, snowflake-shaped queries, and complexz-shaped queries. We have used 
WatDiv datasets with 10M to 100M triples with scale factors 10 and 100, respec- 
tively. In addition, we have generated the SPARQL queries using WatDiv Query 
Generator. 

LUBM comes with a Data Generator (UBA) which generates synthetic data 
over the Univ-Bench ontology in the unit of a university. LUBM provides Test 
Queries, more specifically 14 test queries. Our LUBM datasets consist of 1000, 
2000, and 3000 universities. The number of triples varies from 138M for 1000 
universities, to 414M triples for 3000 universities. 


Table 1. Dataset characteristics (nt format). 


LUBM Watdiv 

1K 2K 3K 10M 100M 
#nr. of triples | 138,280,374 | 276,349,040 | 414,493,296 | 10,916,457 108,997,714 
Size (GB) 24 49 70 1.5 15 


We implemented our approach using Spark-2.4.0, Scala 2.11.11, Java 8, and 
all the data were stored on the HDFS cluster using Hadoop 2.8.0. All experi- 
ments were carried out on a commodity cluster of 6 nodes (1 master, 5 workers): 
Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10 GHz (32 Cores), 128 GB RAM, 12 
TB SATA RAID-5. We executed each experiment three times and the average 
query execution time has been reported. 
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4.2 Preliminary Results 


We run experiments on the same cluster and evaluate our approach using the 
above benchmarks. In addition, we compare our proposed approach with selected 
state-of-the-art distributed SPARQL query evaluators. In particular, we compare 
our approach with SHARD [17] — the original approach implemented on Hadoop 
MapReduce, SPARQLGX [6]’s direct evaluator SDE, and Sparklify [21] and 
report the query execution time (cf. Table 2). We have selected these approaches 
as they do not include any pre-processing steps (e.g. statistics) while evaluating 
the SPARQL query, similar to our approach. 


Table 2. Performance analysis on large-scale RDF datasets. 


Runtime (s) (mean) 


Queries SHARD |SPARQLGX-SDE/SANSA.Sparklify|SANSA.Semantic 
= C3 n/a 38.79 72.94 90.48 
5 F3 n/a 38.41 74.69 n/a 
3 L3 n/a 21.05 73.16 72.84 
5 S3 n/a 26.27 70.1 79.7 
2 
S C3 n/a 81.51 96.59 300.82 
2 F3 n/a 62.86 91.2 n/a 
k L3 n/a 84.09 y 189.89 
= S3 n/a 23.6 93.02 176.2 

Q1 774.93 03.74 103.57 226.21 
Q2 fail fail 3348.51 329.69 
Q3 772.55 26.31 107.25 235.31 
Q4 988.28 82.52 111.89 294.8 
Q5 771.69 01.05 100.37 226.21 
~ Q6 fail 73.05 100.72 207.06 
3 Q7 fail 60.94 113.03 277.08 
n Q8 fail 79.56 114.83 309.39 
4 Q9 fail 204.62 114.25 326.29 
Q10 780.05 06.26 110.18 232.72 
Q11 783.2 12.23 105.13 231.36 
Q12 fail 59.65 105.86 283.53 
Q13 778.16 00.06 90.87 220.28 
Q14 688.44 74.64 100.58 204.43 


Our evaluation results for performance analysis, sizeup analysis, node scala- 
bility, and breakdown analysis by SPARQL queries are shown in Table 2, Figs. 2, 
3 and 4 respectively. In Table 2 we use “fail” whenever the system fails to com- 
plete the task and “n/a” when the task could not be completed due to a parser 
error (e.g. not able to translate some of the basic patterns to RDDs operations). 
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In order to evaluate our approach with respect to the speedup, we analyze 
and compare it with other approaches. This set of experiments was run on three 
datasets, Watdiv-10M, Watdiv-100M and LUBM-1K. 

Table 2 presents the performance analysis of the systems on three different 
datasets. We can see that our approach evaluates most of the queries as opposed 
to SHARD. SHARD system fails to evaluate most of the LUBM queries and its 
parser does not support Watdiv queries. On the other hand, SPARQLGX-SDE 
performs better than both Sparklify and our approach, when the size of the 
dataset is considerably small (e.g. less than 25GB). This behavior is due to the 
large partitioning overhead for Sparklify and our approach. However, Sparklify 
performs better compared to SPARQLGX-SDE when the size of the dataset 
increases (see Watdiv-100M results in the Table 2) and the queries involve more 
joins (see LUBM-1K results in the Table2). This is due to the Spark SQL 
optimizer and Sparqlify self-joins optimizers. Both SHARD and SPARQLGX- 
SDE fail to evaluate query Q2 in the LUBM-1K dataset. Sparklify can evaluate 
the query but takes longer as compared to our approach. This is due to the fact 
that our approach uses Spark’s lazy evaluation and join optimization by keeping 
the intermediate results in memory. 


W SHARD W SPARQLGX-SDE W SANSA.Sparklify Wi SANSA.Semantic 


8000 
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2 
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£ 
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œ 
2000 


1K/24 2K/49 3K/70 


LUBM dataset / (size in GB) 


Fig. 2. Sizeup analysis (on LUBM dataset). 


Scalability Analysis. In order to evaluate the scalability of our approach, we 
conducted two sets of experiments. First, we measure the data scalability (e.g. 
size-up) of our approach and position it with other approaches. As SHARD fails 
for most of the LUBM queries, we omit other queries on this set of experiments 
and choose only Q1, Q5, and Q14. Q1 has been chosen due to its complexity 
while bringing large inputs of the data and high selectivity, Q5 since it has 
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considerably larger intermediate results due to the triangular pattern in the 
query, and Q14 mainly for its simplicity. We run experiments on three different 
sizes of LUBM (see Fig. 2). We keep the number of nodes constant i.e. 5 worker 
nodes and increase the size of the datasets to measure whether our approach 
deals with larger datasets. 

We see that the query execution time for our approach grows linearly when 
the size of the datasets increases. This shows the scalability of our approach as 
compared to SHARD, in context of the sizeup. SHARD suffers from the expen- 
sive overhead of MapReduce joins which impact its performance, as a result, it 
is significantly worse than other systems. 

Second, in order to measure the node scalability of our approach, we increase 
the number of worker nodes and keep the size of the dataset constant. We vary 
them from 1, 3 to 5 worker nodes. 


W SHARD W SPARQLGX-SDE W SANSA.Sparklify Wi SANSA.Semantic 
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Fig. 3. Node scalability (on LUBM-1K). 


Figure 3 shows the performance of systems on LUBM-1K dataset when the 
number of worker nodes varies. We see that as the number of nodes increases, 
the runtime cost of our query engine decreases linearly as compared with the 
SHARD, which keeps staying constant. SHARD performance stays constant 
(high) even when more worker nodes are added. This trend is due to the commu- 
nication overhead SHARD needs to perform between map and reduce steps. The 
execution time of our approach decreases about 1.7 times (from 1,821.75 s down 
to 656.85 s) as the worker nodes increase from one to five nodes. SPARQLGX- 
SDE and Sparklify perform better when the number of nodes increases compared 
to our approach and SHARD. 

Our main observation here is that our approach can achieve linear scalability 
in the performance. 
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Correctness. In order to assess the correctness of the result set, we computed the 
count of the result set for the given queries and compare it with other approaches. 
As a result of it, we conclude that all approaches return exactly the same result 
set. This implies the correctness of the results. 


Breakdown by SPARQL Queries. Here we analyze some of the LUBM queries 
(Q1, Q5, Q14) run on a LUBM-1K dataset in a cluster mode on all the systems. 


W SHARD MM SPARQLGX-SDE_ [ SANSA Sparklify [J SANSA.Semantic 
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LUBM queries 


Fig. 4. Overall analysis of queries on LUBM-1K dataset (cluster mode). 


We can see from Fig.4 that our approach performs better compared to 
Hadoop-based system, SHARD. This is due to the use of the Spark framework 
which leverages the in-memory computation for faster performance. However, 
the performance declines as compared to other approaches which use vertical 
partitioning (e.g., SPARQLGX-SDE on RDD and Sparklify on Spark SQL). 
This is due to the fact that our approach performs de-duplication of triples 
that involves shuffling and incurs network overhead. The results show that the 
performance of SPARQLGX-SDE decreases as the number of triple patterns 
involved in the query increases (see Q5) when compared to Sparklify. However, 
SPARQLGX-SDE performs better when there are simple queries (see Q14). 
This occurs because SPARQLGX-SDE must read the whole RDF graph each 
time when there is a triple pattern involved. In contrast to SPARQLGX-SDE, 
Sparklify performs better when there are more triple patterns involved (see Q5) 
but slightly worse when linear queries (see Q14) are evaluated. 

Based on our findings and the evaluation study carried out in this paper, we 
show that our approach can scale up with the increasing size of the dataset. 
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5 Related Work 


Partitioning of RDF Data. Centralized RDF stores use relational (e.g., 
Sesame [4]), property (e.g., Jena [23]), or binary tables (e.g., SW-Store [1]) for 
storing RDF triples or maintain the graph structure of the RDF data (e.g., 
gStore [26]). For dealing with big RDF datasets, vertical partitioning and exhaus- 
tive indexing are commonly employed techniques. For instance, Abadi et al. [2] 
introduce a vertical partitioning approach in which each predicate is mapped to 
a two-column table containing the subject and object. This approach has been 
extended in Hexastore [22] to include all six permutations of subject, predicate, 
and object (s, p, o). To improve the efficiency of SPARQL queries RDF-3X [14] 
has adopted exhaustive indices not only for all (s, p, o) permutations but also 
for their binary and unary projections. While some of these techniques can be 
used in distributed configurations as well, storing and querying RDF datasets in 
distributed environments pose new challenges such as the scalability. In our app- 
roach, we tackle partitioning and querying of big RDF datasets in a distributed 
manner. 

Partitioning-based approaches for distributed RDF systems propose to par- 
tition an RDF graph in fragments which are hosted in centralized RDF stores 
at different sites. Such approaches use either standard partitioning algorithms 
like METIS [9] or introduce their own partitioning strategies. For instance, Lee 
et al. [12] define a partition unit as a vertex with its closest neighbors based 
on heuristic rules while DiploCloud [24] and AdPart [10] use physiological RDF 
partitioning based on RDF molecules. In our proposal, we use a semantic-based 
partitioning approach. 


Hadoop-Based Systems. Cloud-based approaches for managing large-scale 
RDF mainly use NoSQL distributed data stores or employ various partitioning 
approaches on top of Hadoop infrastructure, i.e., the Hadoop Distributed File 
System (HDFS) and its MapReduce implementation, in order to leverage com- 
putational resources of multiple nodes. For instance, Sempala [19] is a Hadoop- 
based approach which serves as SPARQL-to-SQL approach on top of Hadoop. 
It uses Impala® as a distributed SQL processing engine. Sempala uses unified 
vertical partitioning based on a single property table to improve the runtime of 
the star-shaped queries by excluding the joins. The limitation of Sempala is that 
it was designed only for that particular shape of the queries. PigSPARQL [18] 
uses Hadoop based implementation of vertical partitioning for data represen- 
tation. It translates SPARQL queries into Pig’ LATIN queries and runs them 
using the Pig engine. A most recent approach based on MapReduce is RYA [16]. 
It is a Hadoop based scalable RDF store which uses Accumulo® as a distributed 
key-value store for indexing the RDF triples. One of RYA’s advantages is the 
power of performing join reorder. The main drawback of RYA is that it relies on 
disk-based processing increasing query execution times. Other RDF systems like 


6 https://impala.apache.org/. 
T https://pig.apache.org/. 
8 http://www.accumulo.apache.org. 
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JenaHBase [11] and H2RDF+ [15] use the Hadoop database HBase for storing 
triple and property tables. 

SHARD [17] is one approach which groups RDF data into a dedicated par- 
tition so-called semantic-based partition. It groups these RDF data by subject 
and implements a query engine which iterates through each of the clauses used 
on the query and performs a query processing. A MapReduce job is created while 
scanning each of the triple patterns and generates a single plan for each of the 
triple pattern which leads to a larger query plan, therefore, it contains too many 
Map and Reduces jobs. Our partitioning algorithm is based on SHARD, but 
instead of creating MapReduce jobs we employ the Spark framework in order to 
increase scalability. 


In-Memory Systems. S2RDF [20] is a distributed query engine which trans- 
lates SPARQL queries into SQL ones while running them on Spark-SQL. It 
introduces a data partitioning strategy that extends vertical partitioning with 
additional statistics, containing pre-computed semi-joins for query optimization. 
SPARQLGX [6] is similar to S2RDF, but instead of translating SPARQL to SQL, 
it maps SPARQL into direct Spark RDD operations. It is a scalable query engine 
which is capable of evaluating efficiently the SPARQL queries over distributed 
RDF datasets [7]. It uses a simplified VP approach, where each predicate is 
assigned to a specific parquet file. As an addition, it is able to assign RDF 
statistics for further query optimization while also providing the possibility of 
directly query files on the HDFS using SDE. Recently, Sparklify [21] — a scalable 
software component for efficient evaluation of SPARQL queries over distributed 
RDF datasets has been proposed. The approach uses Sparqify® as a SPARQL 
to SQL rewriter for translating SPARQL queries into Spark executable code. 
In our approach, intermediate results are kept in-memory in order to accelerate 
query execution over big RDF data. 


6 Conclusions and Future Work 


In this paper, we propose a scalable semantic-based query engine for efficient 
evaluation of SPARQL queries over distributed RDF datasets. It uses a semantic- 
based partitioning strategy as the data distribution and converts SPARQL to 
Spark executable code. By doing so, it leverages the advantages of the Spark 
framework’s rich APIs. We have shown empirically that our approach can scale 
horizontally and perform well as compared with the previous Hadoop-based 
system: the SHARD triple store. It is also comparable with other in-memory 
SPARQL query evaluators when there is less shuffling involved i.e. less duplicate 
values. 

Our next steps include expanding our parser to support more SPARQL frag- 
ments and adding statistics to the query engine while evaluating queries. We 
want to analyze the query performance in the large-scale RDF datasets and 
explore prospects for the improvement. For example, we intend to investigate 
the re-ordering of the BGPs and evaluate the effects on query execution time. 


° https: //github.com/SmartDataAnalytics/Sparqlify. 
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Abstract. With the continuous growth of the Linked Data Cloud, 
adequate methods to efficiently explore semantic data are increasingly 
required. Faceted browsing is an established technique for exploratory 
search. Users are given an overview of a collection’s attributes that can 
be used to progressively refine their filter criteria and delve into the data. 

However, manual facet predefinition is often inappropriate for at least 
three reasons: Firstly, heterogeneous and large scale knowledge graphs 
offer a huge number of possible facets. Choosing among them may be 
virtually impossible without algorithmic support. Secondly, knowledge 
graphs are often constantly changing, hence, predefinitions need to be 
redone or adapted. Finally, facets are generally applied to only a subset 
of resources (e.g., search query results). Thus, they have to match this 
subset and not the knowledge graph as a whole. Precomputing facets for 
each possible subset is impractical except for very small graphs. 

We present our approach for automatic facet generation and selection 
over knowledge graphs. We propose methods for (1) candidate facet gen- 
eration and (2) facet ranking, based on metrics that both judge a facet 
in isolation as well as in relation to others. We integrate those methods 
in an overall system workflow that also explores indirect facets, before 
we present the results of an initial evaluation. 


Keywords: Faceted browsing - Facet ranking - Knowledge graph - 
Exploratory search 


1 Introduction 


A facet is by definition’ a particular aspect or feature of something. In the 
present work, this is applied to a set of resources that could be viewed under 
different aspects. Each aspect is called a facet and consists of several categories, 
facet values, which can be used to filter the initial resource set. The number of 
resources that are associated with a certain facet value is called value size. 


' Oxford Dictionaries: https: //en.oxforddictionaries.com/definition/facet. 
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Considering an example, a list of books can be viewed under the aspect of 
their genre. Choosing the facet value science fiction, books of this specific genre 
would be selected. The number of selected resources then corresponds to the 
value size of the facet value science fiction. The same list could be viewed under 
the aspect of their publication year, with each sublist containing only books 
published in one particular year. These two aspects, genre and publication year, 
are just two of the many possible facets for books. 

To obtain different facets, we assume each resource to have properties 
assigned, linking them either to other resources (genre, with, e.g., a descrip- 
tion for itself) or plain literal values (publication year). While our method works 
on any resource set possessing such properties, we use semantic models as rig- 
orous formulation. In particular, we consider knowledge graphs (KGs). They 
provide significant advantages for the creation of facets: First of all, assuming 
the resources are drawn from a rich KG, we automatically get a large amount of 
direct resource information from their properties. The values of those properties 
may be resources themselves and can be used to generate indirect facets over the 
initial resource set. For example, an indirect facet for books can be an author’s 
place of birth, where place of birth is linked to author, not to the book itself. 

However, considering continuously changing and heterogeneous resources, 
manually predefining facets is often impractical. Using concepts from large KGs, 
e.g., the Linked Data Cloud, for semantic annotation induces a large number of 
possible facets. Hence, an automated method has to rank the large number of 
candidate facets to be able to pick the most suitable ones among them. 

Nevertheless, determining the single, best facet is not enough. Users generally 
expect a list of facets to choose from. Moreover, this list should not be extremely 
long, and its items should be “useful” both individually and as collection. Were 
it not for the requirement of usefulness also as collection, simply choosing the 
top-k highest-ranked facets would be sufficient. However, avoiding facets that 
are semantically very close to each other is important as well. After their iden- 
tification, criteria need to be defined to decide which of the candidates to drop 
to arrive at the final list of facets. 

We propose an approach for dynamic facet generation and facet ranking over 
KGs. Our ranking is based on intra- and inter-facet metrics to determine the 
usefulness of a facet, also in the presence of others. A key aspect is exploiting 
indirect properties to find better categorizations. Since inter-facet metrics have 
not been satisfactorily addressed so far, we present semantic similarity as a 
usefulness criterion. 

Based on our previously proposed workflow [1], we integrated all methods 
into an initial prototypical implementation [2]. While this leverages data from 
a specific KG, i.e., Wikidata [3], the methods we describe and use are generally 
applicable without or with only minimal changes to a wide range of KGs. Possi- 
ble applications include exploratory browsing of a data catalog of semantically 
annotated datasets, or the reduction of a search result set using facets as filters. 

In Sect. 2 we first revisit some of the related works in this direction. We then 
discuss methods we used for candidate facet generation and ranking in Sect. 3 
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and propose our workflow in Sect.4. We present evaluation results in Sect. 5. 
Finally, we conclude and discuss future work in Sect. 6. 


2 Related Work 


Faceted browsing over KGs has been the subject of various research efforts, 
e.g., [4]. Prominent approaches such as Ontogator [5] or mSpace [6] use statically 
predefined facets for data navigation and do not consider continuously changing 
data sources. Moreover, their evaluation scenarios suppose data homogeneity 
and domain-dependent collections like cultural artifacts [5] or classical music [6]. 

Other projects include BrowseRDF, Parallax, gFacet, Faceted Wikipedia, 
VisiNav, Rhizomer, SPARKLIS, SemFacet, Grafa, MediaFaces, and Hip- 
palus ([7—-17], resp.). Facets are either dynamically selected from a precomputed 
set of facets or dynamically generated on the fly. The latter type of facets relies 
on building dynamic SPARQL queries and executing them on the respective 
SPARQL endpoints. Grafa [15] proposed a selection strategy to precompute 
only a subset of possible facets to avoid indexing of all data. 

Some of these projects assume a homogeneous data source [7,17], using 
very specific data sets from the domains of, e.g. species [17], other contribu- 
tions account for domain heterogeneity [8-16] and base their work on large 
scale KGs such as Wikidata [3], Dbpedia [18], or Freebase [19]. However, in 
some projects [9,10,12,13], an initial interaction (resource type specification) is 
required, before any facets are generated. 

Various aspects of facet generation are discussed. This includes facet rank- 
ing [7, 10-12, 15-17], entity type pivoting? [8,9,11-14], visualization [8,9, 11-13], 
indirect facet generation |[6,7,9,13,14], or performance issues [10,13,15]. 

Facet ranking is of particular importance for dynamic facet generation in 
order to select from the considerable number of facet candidates. Frequency- 
based ranking was adopted by [10-12,15]. In Faceted Wikipedia [10], facet values 
are ranked based on the value sizes. For facet ranking, the most frequent facets 
corresponding to the selected type are candidates. They are ranked based on 
their most frequent facet value. Note that a ranking is applied only in case of 
resource type selection, otherwise generic facets are displayed. VisiNav [11] also 
adopts a frequency-based approach to rank facets and facet values inspired by 
PageRank [20]. The respective scores are calculated based on the PageRank 
score of the data sources [21]. Rhizomer [12] defines relevant facets based on 
the properties usage frequency in the resource type instances and the number 
of different facet values. In Grafa [15], facets are ranked according to the num- 
ber of search result resources that have a value for the specific facet and facet 
values are ordered by PageRank. BrowseRDF [T] proposes three metrics to mea- 
sure the quality of facets: (1) predicate balance, considering faceted browsing 
as the operation of traversing a decision tree where the tree should be well bal- 
anced (2) object cardinality, the number of facet values as also considered in [12] 


? Switching the focus type, e.g., from a set of books to the set of their authors. 
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(3) predicate frequency similar to [10,12,15]. The metrics are combined to a final 
score that is used to rank facets. In MediaFaces [16], facets are ranked based on 
the analysis of image search query logs and users tags of Flickr® public images. 
Hippalus [17] introduces a different ranking approach involving user interactions 
where users rank facets and facet values according to their manually defined 
preferences. 

We notice that all the previously described efforts concerning facet ranking 
only involve intra-facet metrics that rate facets individually without taking into 
consideration the significance of facet co-occurrence, or in other words inter-facet 
metrics. To the best of our knowledge, only Facetedpedia [22] includes a metric 
for measuring the collective usefulness of a facets collection. However, it does 
not take advantage of KGs or semantically annotated collections, but generates 
facets over Wikipedia* pages based on the Wikipedia category system. They 
consider the navigational cost, i.e. the number of edges traversed, as an intra- 
facet metric that is based on the number of steps required to reach target articles 
and the number of choices at each step. Furthermore, facets are penalized if they 
have a low coverage, i.e., not all the articles can be reached using the considered 
facet. Besides the navigational cost, the average pairwise similarity is proposed 
as an inter-facet metric. However, the used metric is specifically designed to be 
applied on the Wikipedia category system and is not generic enough to express 
semantic similarity in the sense of arbitrary KGs. 


3 Methods 


Before presenting our proposed workflow, this section provides details on the 
employed methods. This includes initial candidate facet generation, handling of 
literal facet values, and the metrics used to compare facets. The latter discussion 
is split into two parts: Intra-facet metrics evaluate a facet in isolation, whereas 
inter-facet metrics judge facets in relation to others. 


3.1 Candidate Facet Generation 


We aim to generate facets over a set of resources given by their respective Inter- 
nationalized Resource Identifiers (IRIs) within the KG. In such a graph we treat 
the relations of the given resources as their properties and thus any applicable 
property path is equivalent to a candidate facet. To achieve a better categoriza- 
tion of resources, we consider not only the direct properties (i.e., values that 
are connected to the resource by a single link), but also indirect properties (i.e., 
chained links are needed to connect a resource and a value). As an example, 
consider a set of resources referring to people. A direct property can be derived 
from a relation place of birth pointing to instances of a class city. An indirect 
property could then also exploit an existing link between city and country? to 


3 https: //www.flickr.com/. 
4 https: //www.wikipedia.org/. 
5 Assuming there is no direct link between persons and their country of birth. 
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arrange the connected cities into possibly fewer categories®. Indirect properties 
are only possible, if the range of the associated relation is not a literal, as those 
can not be the subject of further statements in the standard RDF model. 

A candidate facet is now given by a property path within the KG. In case 
of direct properties this path is of length one, whereas for indirect properties 
any path length greater than one is possible. However, longer paths loosen the 
connection between resources and facets values. At some point this renders a 
facet useless for the given task or at least makes it unclear to users how that facet 
is supposed to support them. Furthermore, longer paths increase the number 
of candidates and thus require more computations in later phases. For these 
reasons, we limit the path length for candidates by a threshold 7T. 

We categorize candidate facets into two types: (1) Categorical facets that 
result from property paths connecting exclusively to other resources and (2) 
quantitative facets whose values are given by literals. While we allow quantitative 
candidates for numeric or date literals, we exclude string literals. The rationale 
is that those oftentimes contain labels or descriptions specific to single resources 
and, hence, are barely shared between different ones. As facets rely on common 
values to categorize the given input set, these properties will only rarely provide 
a suitable candidate facet. If a string value is common to multiple resources, 
there is a high chance, that this should have been modeled as a distinct resource 
instead of a literal. Of course, resources are often not modeled perfectly. Future 
work might need to include these to be able to cope with this type of data. 


3.2 Clustering of Quantitative Facets 


As mentioned before, facets can be created from numeric or date literals. Unlike 
categorical facets, it is highly unlikely that the number of distinct values is suf- 
ficiently small to generate a useful facet. However, these values can be clustered 
by dividing their continuous range into discrete subranges. 

The clustering step is only applied to quantitative facets. It replaces the 
associated values with value ranges. The number of these clusters is determined 
by the optimum value cardinality as defined by the respective intra-facet met- 
ric (see Subsect.3.3). The clustering technique itself is a consequence of the 
rationale behind another intra-facet metric, the value dispersion. It assembles 
approximately the same number of values in each cluster. 


3.3 Intra-Facet Metrics 


To select the most useful facets among the candidates, we define metrics to 
judge their usefulness. The first set of metrics presented here assigns scores to 
individual candidates independently of each other. Each metric is designed to 
reflect one intuition of what constitutes a useful facet. 

The first requirement concerns the applicability of the facet. For each facet 
we also include an unknown value. This accumulates the resources that do not 


6 Cities belonging to the same country will be grouped into one category. 
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support the respective property path, i.e., at least one of the corresponding rela- 
tions is missing for this resource. For heterogeneous resource sets, the unknown 
value size will be non-zero for most facets. However, for a facet to be useful, it 
should apply to as many resources as possible. So we strive for the value size of 
unknown to be small in comparison with the overall size of the resource set. 

These thoughts lead to the definition of predicate probability of a facet f, 
SCOP €predicateProb, aS given in Eq. 1. It calculates, for a randomly chosen resource, 
the probability to support the property path of a given facet. 


|supporting resources| 
SCOTepredicateProbl J) = 


(1) 
|resources| 

Our next requirement deals with the number of facet values. We consider a 
facet with only a single value as not useful, as it can not be used to narrow down 
the given set of resources. But then again, facets with too many values provide 
little help as well. Here, users have to scan through a long list of possible options, 
which may even rival the number of input resources. We believe that there is a 
number of values that is optimal in the sense that it balances between a concise 
categorization and a sufficient number of options to choose from. 

Following these considerations, we define the value cardinality, 
SCOPrEyalueCard, Of a facet f with a number of values cr as given in Eq. 2. The 
minimum cardinality is denoted by minCard and the optimal one by optCard. 
Note that we chose an asymmetric function that favors facets with fewer values 
rather than more. This follows the intuition that better categorizations tend to 
have fewer categories. The parameter 6 4 0 allows to adjust the preference for 
value sizes between minCard and optCard. 


0 if cp < minCard 
cf —optCard 


SCOT CyalueCard(f) = 4 e7 2 if minCard < cf < optCard (2) 
IE oard) if cf > optCard 
Our final requirement follows the principle of self-balancing search trees: 
Each decision made while traversing the tree should eliminate roughly the same 
number of results from consideration. In other words, no leaf node (representing 
a specific result) is preferred over others in terms of steps needed to reach it 
from the root node. Similarly, we do not want to favor any specific category. 
For a facet, this means that all value sizes within a single facet should be 
approximately equal”. As a measure for the variance in value sizes, we employ the 
coefficient of variation c, (see Eq. 3). We chose this coefficient over the plain stan- 
dard deviation, as it allows to better compare across multiple facets with possibly 
different value sizes. Using this, we define the value dispersion, scorédispersion; 


T The subsets induced by the different facet values do not have to be disjoint. A single 
resource may be linked to several such values. Consider, e.g., the relation part of 
that relates country and continent. Here, the individual Russia is connected to two 
continents, Asia and Europe, thus appearing as part of both facet values’ results. 
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as given in Eq. 4. Here, N is the number of facet values, x; denotes the value 
size of the ith facet value, and Z is the average of all value sizes. We exclude the 
value size of the special facet value unknown from this calculation, as this value 
is already exploited in scoreépredicateProb: 


fx Ehl- 


T 


c(f) = (3) 


1 


All presented metrics are designed to return only values in the range between 
zero and one. In order to combine them into a single metric used in the ranking 
process (see Sect. 4), we can use a weighted average as shown in Eq. 5. With the 
individual weights summing up to one as well, we assure that the final score is 
also between zero and one. 


SCOT Edispersion (f) 


score( f) = WpredicateProb X SCOT E€predicateProb 


C Wdispersion X SCOTEdispersion 


T WyalueCard X SCOr€yglueCard (5) 


with 5 w; = 1 


3.4 Inter-Facet Metrics 


In contrast to their intra-facet counterparts, inter-facet metrics assess the rela- 
tionship between different candidate facets. We use semantic similarity of facets 
as an inter-facet metric. The motivation is to prevent facets that are too close to 
one another and thus would provide about the same partitioning of the resource 
set. Moreover, semantically distant facets increase the chances of meeting users’ 
information need and/or mindset. 

Generally, no restrictions are imposed on the semantic similarity measure 
chosen to be included in the current facet generation workflow. However, we 
base our workflow on a structure-based measure that combines the shortest 
path length and the depth. In particular, we consider the one proposed by [23] 
as reference similarity metric between two concepts c; and cj, defined as follows: 


ef-depth(cies) _ e` b depth(cies) 


—a-length(ci,c;) (6) 
“eB-depth(cics) +4 e- b- depth(cics) 


sim(ci,cj) =e 


where length(ci, cj) is the shortest path length between c; and c; and depth(cics) 
is the shortest path length between the Least Common Subsumer (LCS) of the 
two concepts, Cies and the root concept. a > 0 and 8 > 0 are used to adjust 
the importance assigned to the shortest path length and the depth, respectively. 
Based on the correlation evaluation conducted by [23], the optimal parameters 
are a= 0.2 and 8 = 0.6. 
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The previously defined semantic similarity metric takes a pair of concepts 
as input. Therefore, a mapping between properties and concepts needs to be 
available. For this purpose, we exploit a particular characteristic of Wikidata’s 
data model: Properties are annotated with a matching entity. For example, the 
property author (P50) is itself linked to the entity author (Q482980). This 
allows us to retrieve entities corresponding to the property path of a facet. 

When comparing two facets, we first retrieve the respective entities for the 
first property in their property paths. We then calculate the semantic similarity 
between the entity pair. Two entities are considered similar, if sim is larger than 
a defined threshold ø. Since we calculate the similarity over Wikidata taxonomy, 
we only consider links using subclass of (P279) and instance of (P31) here. 


4 Workflow 


We consider the facet generation to be part of larger applications. In particu- 
lar, we assume that the retrieval of an initial resource set is subject to other 
independent components. Hence, details of the resource retrieval process are out 
of scope at this point. For the sake of argument, we base our workflow on the 
results of a keyword-based full text search over the string properties of entities 
in the KG. Its result is represented as a set of IRIs, each identifies a single result 
item or resource and forms the input to our proposed facet generation workflow. 
We structured the overall process into four phases as shown in Fig. 1. 


Phase 1 Phase 2 Phase 3 Phase 4 
> : Intra-facet scoring Selection of better Inter-facet scoring 
Candidate generation 3 Sn : 
and ranking categorization and filtering 


Fig. 1. Phases of the facet generation process. 


Phase 1: Candidate Generation 

This first phase enumerates possible facets by querying for a list of property- 
paths associated with the input list of resources. As the predicate probabil- 
ity SCOrepredicateProb iS a Simple metric, we choose to include it as part of the 
query. Candidates that have a scor€predicateProb below a predefined threshold, 
minPredProb, are already removed in this phase. This reduces the necessary 
data transfers and the calculation of computationally expensive metrics. The 
result is a list of candidates, each comprised of a basic graph pattern (BGP), that 
describes the facet, and a score to reflect the fraction of resources it applies to. 


Phase 2: Intra-Facet Scoring and Ranking 
As a prerequisite for the remaining intra-facet metrics, now the facet values 
along with the respective value size are retrieved from the SPARQL endpoint. 
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We distinguish between object and data properties® at this point. The latter 
are subjected to the clustering described in Subsect.3.2 to derive comparable 
characteristics with regard to intra-facet metrics. 

After augmenting the facets with their respective values, the remaining intra- 
facet metrics, SCOTEdispersion ANd SCOreyalueCard, are calculated for all candidates. 
This allows us to compute the final intra-facet score, score(f), and accordingly 
rank all facets in decreasing order. 


Phase 3: Selection of Better Categorization 

The number of necessary inter-facet metrics calculations grows quadratically 
with the remaining number of candidates. To reduce the list of candidates before 
the next step, we exploit a key characteristic of the semantic similarity metric. 
The similarity only depends on the first direct property of each facet. Con- 
sequently, out of all candidates sharing the direct property, only one will be 
chosen for the final result, as all others will be too similar to it. Leveraging this 
observation, we can group the candidates by their direct properties and only 
choose the best-ranked one within each group. 


Phase 4: Inter-Facet Scoring and Filtering 
The final result is derived by consecutively applying inter-facet metrics to chosen 
pairs of candidates. Calculating semantic similarities is rather expensive. To 
minimize the comparisons required, facets are selected in a greedy fashion. 

Let C be the list of candidates in decreasing order w.r.t. the intra-facet metric 
scoring of Phase 2 and S be the final collection of facets as returned by Phase 4. 


(i) Initialize S with the best-ranked facet. 

(ii) Take the next facet out of C and compare it with the facets in S. 

(iii) If it is not closely semantically similar to any facet in S, add it to S. 

(iv) Continue with Step (ii) until the desired number of facets is reached or there 
are no more candidates left. 


Finally, S will contain a subset of facets deemed most suitable for the given 
input set of resources. The suitability has been determined by employing both 
the intra- and inter-facet metrics, which can be extended or changed without 
affecting the corresponding workflow. S can now be presented to users. Note that 
selecting specific value and subsequently reducing the result set will trigger a new 
facet generation process, as the basis for our calculations—the input resource 
set—imight have changed substantially. 


5 Evaluation 


The methods described in Sect.3 were implemented in a prototype that 
issues dynamic SPARQL queries to the public SPARQL endpoint of Wikidata 
(WDQS)°. The source code is available online [2], under an MIT license. 


8 Data properties using string literals have already been excluded in the candidate 
generation. That means, only numeric and date literals are considered here. 
° https: //query.wikidata.org/. 
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Table 1. Number of candidates depending on path length and number of IRIs. 


7 IRIs | 100 1000 2000 3000 4000 
T=1 37 52 65 66 75 
T=2 901 1643 2039 2342 2648 
T = 3 16076 31543 39318 44619 50843 


(ms, Other 

50 s m Semantic similarity (Phase 4) 
(Qs Facet value retrieval (Phase 2) 
(Qs Candidate generation (Phase 1) 


500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 


Fig. 2. Benchmark results: average timings depending on the input IRI size. 


5.1 Benchmarking 


To evaluate the performance of our prototype we used a collection of IRIs 
extracted from Wikidata (instances of novel (Q8261) or its subclasses). 

First, we examined the change in the number of candidates depending on 
the path length 7 and number of input IRIs. Results are shown in Table1l. As 
expected, the number of candidates increases significantly -about 20-fold— for 
each additional hop in the paths. However, a growth in input IRIs yields only a 
small effect in comparison. These figures and the considerations of Subsect. 3.1, 
led to a path length of 7 = 2 for the remainder of the evaluation. 

Subsequently, we looked at the run-time of our prototype for varying sizes of 
input IRIs. We fixed the semantic similarity threshold (o = 0.70), the parameters 
for value cardinality scoring (optCard = 10, minCard = 2, and 6 = 3), and 
the predicate probability threshold (minPredProb = 0.1). Figure2 shows a 
breakdown of the measured execution times, averaged over about 350 individual 
measurements over the course of a week. We observe a less than linear growth 
of run-time depending on the input IRI size. The most expensive operations are 
(1) candidate generation, (2) facet value retrieval, and (3) semantic similarity. 
Other operations such as intra-facet metric calculation and selection of better 
categorization do not contribute significantly. A detailed analysis revealed that 
the execution times are largely dominated by querying the SPARQL endpoint. 

Overall, we acknowledge that the current performance prohibits any pro- 
ductive use. However, the overwhelming impact of query response times on the 
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Suppose that the following facets are already displayed. Which one of the two alternatives 
would you add to the facets list ? 


film 


e 200 results 
+ director of 
hotograph 
R IRRD RI The Day After Tomorrow 
2004 im by Roland Emmerich 
+ production 
designer SKYFALL SWfall 
2012 james Bon 


vd fim by Sam Mendes 


+ number of V V for Vendetta 
seasons VENDETTA 2005 British dystopian thriller fir 


@ Choose one of the following answers 


e 
+genre + camera operator 


Fig. 3. User evaluation: Fictitious interface for facet selection task. 


overall execution time indicates potential for improvement. Further paralleliza- 
tion and caching of reoccurring queries might prove fruitful. 


5.2 User Evaluation 


Setup. In a survey-based user evaluation, we examined whether facets generated 
by the proposed workflow match user expectations. Based on a fictitious scenario, 
we assumed an initial search with the keyword “film”. 

After introducing users to the general concepts of faceted search and the 
given scenario, we asked for user preferences in a series of questions categorized 
into two kinds of situations: one for facet selection and one for facet ranking. 
In facet selection (cf. Fig.3), users were presented with a static user interface 
that resembles a common search engine and includes three different facets, e.g., 
director of photography, production designer, and number of seasons. They were 
then given two more facets, e.g., genre and camera operator, and were asked 
which would be a better addition to the existing three facets. In facet ranking, 
we presented three to four different facets per question and asked users to rate 
their usefulness in the given scenario using a five point Likert scale [24]. 

Unlike facet selection, where only facet headers are shown, facet ranking also 
includes facet values. Unless noted otherwise, all facets and their values are 
modeled according to the data present in Wikidata as of February 2019 using a 
path length of r = 2. The facets are generated by an initial, prototypical imple- 
mentation of the workflow, but were manually adapted to reflect the respective 
evaluation intent to emphasize specific intra-facet scores. 

Using these situations, the following order of questions was used in the survey. 
Overall, we created a pool of 43 questions, out of which a random subset of 15 
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Fig. 4. Usage of facets. An option “never” was provided, but not chosen by any user. 


was chosen for each user. This approach is intended to reduce the bias that might 
arise from certain terms used throughout. 

In a first set of questions we focus on inter-facet comparisons using facet 
selection. In particular, this evolves around the selection of better categorization 
(Phase 3 in Sect. 4) and semantic similarity (Subsect. 3.4). 

A second set of questions uses facet ranking with facets modeled after Wiki- 
data. This compares multiple indirect facets with their respective direct coun- 
terparts. Here, the indirect facets also vary in their intra-facet scores, allowing 
us to evaluate our strategy in the selection of better categorization. 

Finally, we used facet ranking, this time with abstract facets, i.e., replacing 
facet headers with “Facet 1” etc. and values with “Value 1” etc. The reason is 
again to reduce bias stemming from the actual semantics of the proposed facets. 
In this last part of the evaluation, we issued questions, where the proposed facets 
differed only with respect to one intra-facet metric!?. In a similar fashion, we 
also examined combinations of two and all three proposed intra-facet metrics. 

For the survey, we recruited 26 volunteers differing in age (18-44) and edu- 
cational background. In total, they performed 130 facet selections and 936 indi- 
vidual facet ratings. Most of the participants stated at least an occasional use of 
facets, if they are provided (cf. Fig. 4). Consequently, we assume that they are 
familiar with the general behavior of faceted browsing. 


Results. For each question in facet selection, we derive the percentage of par- 
ticipant selections that match the system decision. Figure 5 shows the results 
of the first question set with each dot representing agreement of one particular 
question!!. For the selection of better categorization we see an overall agreement 
between the survey users and our system of ~83%. 

The average result for semantic similarity is mixed (~63%). How- 
ever, when analyzing the agreement per question, we see a more polarized 
result. While users most often agree on a specific facet, our system is not 
always able to concur with this choice. This leads us to believe that the 
survey responses were driven more by the applicability of the individual facet 
and not its relation to the already given ones. Yet, this is dependent on the 
available information and hence, out of control of the proposed workflow. 

In facet ranking, we are not interested in the specific numerical values each 
metric provides, but focus on the ranking induced by those metrics. To compare 
the ranking determined by our system with the ranking induced by the sur- 


10 The respective other metrics did not vary within a small error margin. 
11 By experiment design, not all questions received the same number of responses. 


322 L. Feddoul et al. 


i i i i i 
Semantic similarity | æ è i ° i@e@ o i o i 
i i i i i 
i i i i i 
Selection of better categorization | ' ' ' ccoo | 
M 
0% 20% 40% 60% 80% 100% 


Categorization 

Card 

Disp 

Prob | | 

Card, Disp e 
Card, Prob | 1 

Disp, Prob ° 
Card, Disp, Prob | i e $ 


eo 
e 
-----6 
Sasa 
° 
° 


© -o-o-o -0-0 0-0-0- 


-1 -0.5 0 0.5 1 


Fig. 6. Rank correlation for facet ranking tasks. One dot per survey question. Value 
Cardinality (Card), Value Dispersion (Disp), Predicate Probability (Prop). 


vey responses, we encoded the latter using numerical values and calculated an 
average rating for each facet. For each question, we ranked the presented facets 
according to these ratings, which results in a survey ranking. We then chose 
Kendall’s Tau-B1? to compare our system ranking with this survey ranking. 

The survey responses for the second question set, concerned with the selection 
of better categorization, are shown in the topmost lane of Fig.6. The overall 
result shows no clear support for our approach in this step. When there was 
no (obvious) relation between the indirect property and the initial resource set 
(e.g., a facet for country of origin/driving side), users rated the facet rather low. 
However, the system sometimes favors these facets, as they oftentimes provide 
a good categorization with respect to the defined metrics. On other occasions, 
like the facet country of origin/continent, both users and the system agree that 
this is a helpful facet. This leads us to believe that, although indirect facets are 
promising, they require additional refinement to ensure their relevancy. 

The final question set verified our metrics independent of semantic biases 
induced by real-world facets. Results are shown in the lower parts of Fig. 6. In 
general, survey participants agree almost completely with our approach. The 
only exceptions are due to a tie (Card, Disp) or a different opinion about the 
order of one particular pair of facets (Disp, Prob and Card, Disp, Prob). 

The user evaluation suggests that the technical criteria seem well suited in 
isolation. However, resulting facets not only have to be evaluated against each 


12 Kendall’s Tau-B is a variant of Kendall’s Tau that also accounts for possible ties in 
the ranking. Values range from +1 for identical rankings to —1 for inverse ones. A 
value of 0 hints towards no correlation between the involved rankings. 
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other, but also against the semantic context of the input IRIs. While in search 
tasks user input can be used to assess this intent, it remains open how this can 
automatically be approximated for arbitrary resource sets. 


6 Conclusion 


We have proposed methods to enable automatic facet generation and ranking 
over KGs. In particular, we provided an approach for dynamic candidate facet 
generation for arbitrary input sets of resources. We defined intra- and inter-facet 
metrics to rank the candidates and reduce the possible facet space by selecting 
the most useful ones. We explored indirect properties to find better catego- 
rizations and consequently enhance facets’ usefulness. We proposed semantic 
similarity as a criterion to select among multiple candidate facets. Finally, we 
developed a holistic workflow that integrates all proposed methods. 

Initial survey results support the used metrics. While indirect facets show 
promise as a helpful addition, their relevancy for the initial resource set needs 
to be ensured. This latter issue is also the main focus of our future efforts: How 
can we estimate the relatedness to the initial input for indirect facets? Another 
prime direction is a performance improvement of our initial prototype, to make 
it applicable for real-world systems (e.g., caching and parallelization of queries). 
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Abstract. SPARQL enables users to access and browse knowledge 
graphs in a precise way. However, using SPARQL requires knowledge that 
many casual users lack. To counter this, specific tools have been created 
that enable more casual users to browse and query results. This paper 
evaluates and compares the most prominent techniques, Query VOWL, 
SPARKLIS and the Wikidata Query Service (WQS), through a usability 
evaluation, using a mixed-method evaluation based on usability metrics 
and heuristics, containing both quantitative and qualitative data. The 
findings show that while WQS achieved the best results, usability prob- 
lems were encountered in all tools. Key aspects for usability, extracted 
from the evaluation, serve as important contributions for future query 
builders. 


Keywords: Knowledge graphs - SPARQL - Query builder - Usability 


1 Introduction 


Linked Open Data (LOD) describes knowledge graphs (KGs) from open sources, 
in such a way that data can be interlinked [13]. These KGs are ubiquitous, 
becoming the de facto standard for heterogeneous data integration [5]. Projects 
such as DBpedia that converts semi-structured content from Wikipedia [2], or 
Wikidata that offers an open KG created by volunteers [23], are just two KG 
examples of many successful stories in the Linked Open Data ecosystem. 

Traditionally, SPARQL [12], the W3C recommended structured query lan- 
guage for graphs, enables users to access and browse these KGs in a precise 
way. However, using SPARQL requires a non-negligible knowledge that many 
end users do not have [9]. While it is a precise and expressive language, it also 
needs the user to conform to its complex syntax. In addition, it can be diffi- 
cult to manipulate the interconnected graphs or to gain satisfying results from 
queries within them [13]. Therefore, SPARQL is mainly geared towards experi- 
enced users (i.e., semantic web practitioners) with prior knowledge or insights 
regarding the SPARQL query language and the structure of datasets (i.e., the 
© The Author(s) 2019 
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underlying data model). To counter this, specific tools have been created that 
enable more casual users (i.e., laypeople) to browse and query results. These 
so-called SPARQL query builders enable users to generate queries that provide 
satisfying results either by suggesting relevant parts of the query or through the 
use of graphical metaphors [9]. While many of these query builders are available, 
documentation about the evaluation of their usability is scarce. The evaluation 
process is often not accurately described and therefore may not lead to mean- 
ingful results [9]. 

This paper aims to fill this gap in the scientific literature by evaluating the 
usability of the most prominent SPARQL query builders. In particular, we focus 
on studying the usability of SPARQL query builders with users that are inexperi- 
enced and have no prior knowledge of SPARQL. For this purpose, we first analyze 
existing SPARQL query builders and categorize them based on the querying app- 
roach into form-based, graph-based and natural language-based query builders. 
Then, three query builders, the Wikidata Query Service! (WQS), Query VOWL 
[11] and SPARKLIS [8] are selected based on factors including their availability, 
querying approach and expressiveness. We compare them using a mixed-method 
approach consisting of a combination of quantitative and qualitative methods 
[10]. We first design three tasks to be performed in each of the three tools, 
and we then conduct a user study with 15 individuals. Quantitative data, that 
we gather, include the time per task, completion rate and the amount of hints 
it took to finish the tasks. Furthermore, we evaluate each tool with a System 
Usability Scale (SUS) questionnaire [6]. Qualitative data are collected through 
the use of the think-aloud method [7] and the information is analyzed afterwards 
by clustering the think-aloud protocols with the usability heuristics [19]. 

Our results show that the querying approach is not as important for the 
usability of the tool as it may seem, and user satisfaction and preference were 
mostly influenced by the interface design and ease of use of the tools. In particu- 
lar, the form-based WQS approach offered the best usability of the three selected 
tools, although a majority of participants would prefer traditional keyword-based 
search engines over the presented query builders. We expect that our findings 
can serve as initial blueprints to guide the next generation of KG query builders. 

The rest of the paper is structured as follows: Sect. 2 reviews the functionality 
and limitations of current SPARQL query builders. Furthermore, we provide 
details of the three selected tools for the usability evaluation, and explanations 
as to why they were selected. Section3 shows the design of our mixed-method 
approach. Then, in Sect. 4, we analyze the results of our user study and provide 
lessons learned summarizing the insights gathered and making suggestions for 
future query builder tools. Finally, Sect.5 provides conclusion and future work. 


2 SPARQL Query Builders 


SPARQL query builders are tools that are specifically designed to facilitate the 
process of querying. A range of these tools are available, each with their own 


1 https: //query-wikidata.org/. 
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varying purposes, querying approaches and target audiences [9]. In this section, 
we first categorize and analyze available tools. Then we select the most prominent 
tools for our usability evaluation, based on comparative criteria. 


2.1 Categorization 


Query builders can be categorized and differentiated based on many possible 
criteria. In this paper, we follow a pure user interface (UI) criteria influenced by 
Grafkin and Mironov [9], suggesting a categorization on the basis of the querying 
approach used in the query builders. Thus, we distinguish between form-based, 
graph-based and natural language-based querying approaches. It is important to 
highlight that some approaches combine different elements, following a “hybrid” 
approach [9]. Other existing techniques that can complement these approaches, 
such as AutoSPARQL [17], which uses (supervised) machine learning, are outside 
of the scope of this paper. 


Form-Based Query Builders. Form-based querying is an approach that 
focuses on textual input fields and constructs a query one step at a time. The 
approach resembles SPARQL’s triple pattern syntax and eases the process of 
query building. This is either enabled by suggesting relevant parts of the query 
to users or restricting them to selective inputs. A limitation of this approach 
is that it only allows for a limited set of queries. The classes and objects have 
to be suggested or enabled as a choice for the user to be able to use them in 
the query building process. Furthermore, these tools often do not allow for the 
specification of filters for the results, as they are limited by their input fields. 
Therefore, some tools are not able to formulate advanced queries [17]. 

Examples of form-based approaches are ExConQuer [1], the Linked Data 
Query Wizard [14], VizQuery? and the Wikidata Query Service. The ExConQuer 
Framework is a set of tools meant to explore, convert, and query linked data. In 
the ExConQuer query builder, users can navigate through classes, instances and 
properties in a way similar to facet-based browsing. It was deemed as useful both 
by experts and casual users for exploring and querying linked data [1]. However, 
it does not offer the full expressiveness of SPARQL. 

The Linked Data Query Wizard [14] (LDQW) was designed as a web-based 
tool for exploring, filtering and analyzing data from SPARQL endpoints. Its 
approach is to turn the underlying graph structure into a tabular interface that 
enables interaction with the data set. This interface is meant to take advantage of 
the fact that many users are already familiar with search engines and spreadsheet 
applications. A user study conducted by the creators of the tool showed that 
it was very usable. However, users had difficulties with adding filters and the 
spreadsheet approach showing too many options [14]. 

The VizQuery tool is based on SPARQL triple patterns and used specifically 
for querying data from the Wikidata endpoint. It is a prototype that only offers 


? https: //tools.wmflabs.org/hay /vizquery /. 
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the simplest functionality for creating queries. VizQuery uses the Wikidata API 
to provide auto-completion and suggestions for Wikidata properties and items. 
Although it offers the functionality to use variables for more advanced users, the 
UI is very limited in its approach as it does not allow for the creation of more 
complex queries. 

Finally, the Wikidata Query Service (WQS) is the official query service of 
Wikidata and offers a “Query Helper” that allows users to create queries through 
a form-based approach. As shown in Fig. 1, users can add items to the filter and 
get relevant suggestions and auto-completion for properties. The Wikidata API 
allows users to create queries even if the exact names for entities are unknown. 
Additionally, if users add an item to the filter, WQS will automatically assign 
a relevant property. However, the WQS does not offer the full expressiveness of 
SPARQL, and it is meant to be domain-specific as it can only reach Wikidata. 


Il] Wikidata Query Service Examples @ nep |+| More ools | ~ XA Engish 


ocation WHERE { 
ibase:language "[AUTO_LANGUAGE],en". } 


Query Helper @ SELECT 


instance of =| | volcano -| a oe 
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625 ?coordinate_location. } 


} 
country +] [itay 7] 9 Lint 100 


+ Show coordinate location =| % ti \ 
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Fig. 1. Screenshot of WQS query builder 


Graph-Based Query Builders. The graph-based approach consists of visual 
query builders and systems (VQS). This category describes tools that lower 
the difficulty of creating SPARQL queries by enabling a visual approach. The 
used visualizations are similar to the syntax of textual SPARQL queries. VQS 
supports users in creating syntactically valid queries by constraining and guiding 
their editing actions through the use of a graphical UI. A limitation of this 
approach is that users of these tools still need a rough understanding of the 
underlying schema to formulate a query. Without understanding how SPARQL 
queries are constructed users are not able to successfully visualize the queries in 
some tools [17]. 

Examples of VQS include iSPARQL?, NITELIGHT [21], OptiqueVQS [22] 
and QueryVOWL [11]. iSPARQL and NITELIGHT allow for the whole expres- 
siveness of SPARQL and are query builders for advanced users. Both tools extend 
the traditional SPARQL framework to enhance its functionalities and feature 
drag and drop interfaces to connect graph nodes and predicates. However, they 
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use a complex series of buttons and options to incorporate all their features in 
the interface. Additionally, because of their complexity they require thorough 
knowledge of the underlying data [21]. 

OptiqueVQS [22] is primarily meant to be a product for end users with limited 
technical skills and knowledge. Therefore, it includes a simplified interface that 
is meant to enable users to address basic tasks. A criterion that differentiates 
OptiqueVQS from other query builders is that it was developed to meet industrial 
requirements and was evaluated with industrial users. It was, therefore, designed 
to provide a good balance between usability and expressiveness [22], although, 
it puts the focus on a very concrete profile of users. 

Finally, Query VOWL [11] differs from other graph-based tools because it uses 
the Visual Notation for OWL Ontologies (VOWL) [18] to visualize the queries. 
QueryVOWL enables casual users to build queries by combining the proven to 
be intuitive and understandable VOWL with matching SPARQL mappings. It 
offers a drag and drop enabled graphical UI, where users insert nodes through a 
search box and connect them with predicates or other nodes (see Fig. 2). While 
QueryVOWL supports most of the expressiveness of SPARQL, it is still some- 
what limited in its node-based approach with missing functions and bugs, inher- 
ent characteristics of an initial prototype [11]. 


beta 0.1.1 Caching: W = DBpedia Selection Details 


birth place 


birth date 


b 


Nobel Prize in Physiology or Medicine 


award 


Showing 1 to 1 of 1 entries 


ernatives 


Fig. 2. Screenshot of Query VOWL query builder 


Natural Language-Based Query Builders. Natural language (NL) 
approaches offer users the convenient and valuable option of using natural lan- 
guage for querying. NL-based approaches enable the user to form precise queries 
by providing the high expressiveness of terms that users are familiar with. As 
NL-based approaches interpret natural language phrases, linguistic considera- 
tions have to be taken into account. Tools of this variety are often limited by 
linguistic ambiguities and variability. Furthermore, the development of accu- 
rate NL interfaces is complex and requires considerable implementation efforts. 
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Therefore, NL tools are often domain-specific or tailored to applications making 
them hardly adaptable to other ontologies [15]. 

Examples of NL-based tools are NLP-Reduce [16] and SPARKLIS [8]. The 
NLP-Reduce tool is meant to facilitate the querying of linked data for users with 
no prior knowledge by using a reduced set of natural language processing func- 
tionalities. The system allows for a non-restrictive query language that can con- 
sist of keywords, sentence fragments or full sentences. NLP-Reduce was deemed 
as easy to use without any training. However, as the simple approach avoids com- 
plex linguistic and semantic technology it does not allow for the expressiveness 
of SPARQL [15]. 

SPARKLIS [8] is a natural language-based web tool that offers the full expres- 
siveness of SPARQL and is usable for casual users. As shown in Fig. 3, the query 
in SPARKLIS is represented as a NL sentence in a tree structure and the user 
can focus on different parts of the query to refine it. The selection of parts is 
guided by suggestions that are enabled by SPARKLIS to allow the user to see 
relevant options. If a query element is inserted at the focus, the NL sentence is 
verbalized into a readable form to adapt to that change. Because of its naviga- 
tional approach and the way suggestions are generated, SPARKLIS has some 
problems regarding loading and response times [8]. 
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Fig. 3. Screenshot of SPARKLIS query builder 


2.2 Selected Tools for Our User Study 


We select representative tools for our study, based on criteria, summarized in 
Table1. One of the most important factors of the selection was the availabil- 
ity of tools. As many tools were either not available or operational anymore, 
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such as ExConQuer or NITELIGHT, they could not be evaluated. Web tools 
were preferred because casual users could theoretically find them on the web by 
themselves and use them to query data. The ease of use or focus on laypeople 
was important as a pre-selection criteria*, to enable the evaluation with casual 
users. We considered a tool with a focus on laypeople, (i) if it was specified as 
designed for casual users in the companion research paper(s) describing the tool, 
or (ii) if initial examination tests with laypeople revealed a considerable facility 
when performing simple tasks. 

Thus, our user study finally considers WQS, Query VOWL and SPARKLIS 
as prominent (and available) query builder representatives of form-, graph- and 
NL-based query approaches, respectively. Out of all three tools the WQS offers 
the lowest amount of expressiveness and is furthermore the only tool meant to 
be domain-specific as it can only reach Wikidata. Query VOWL has an intuitive 
graph-based web tool and allows for a certain extent of expressiveness of the 
SPARQL language. Finally, out of all three selected tools, SPARKLIS offers the 
most features of SPARQL and therefore the highest expressiveness, although the 
overload of information and options can negatively influence its usability. The 
following sections provide an extensive usability evaluation of the three selected 
query builders: WQS, Query VOWL and SPARKLIS. 


Table 1. Examined tools with selection criteria and rating (v = yes, ~ = partially, - 


= no). Selected tools are marked in bold 

Query builder | Category Availability | Focus on laypeople | Expressiveness 
ExConQuer form-based |- v ~ 
LDQW form-based v ~ - 
VizQuery form-based | v v - 
WQS form-based | v v ~ 
iSPARQL graph-based v : y 
NITELIGHT |graph-based - P v 
OptiqueVQS graph-based | - y - 
Query VOWL | graph-based v V ~ 
NLP-Reduce |NL-based |- v i 
SPARKLIS |NL-based v ~ Vv 


3 Evaluation Design 


Our usability evaluation follows a mixed-method evaluation design, consisting 
of both a quantitative and qualitative part. It is meant to use the results of 
one method to clarify the results of the other method and therefore to increase 


4 Note that the final usability is evaluated in the next section, here we just select the 
representative tools of each category. 
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the meaningfulness of results by capitalizing on the inherent strengths of both 
methods. This ensures the comparability of quantitative data while it also allows 
for the informative value found in qualitative approaches [10]. We first show the 
quantitative and qualitative data considered in our study. Then, we present the 
participants, test plan and tasks in our study. 


3.1 Quantitative Data 


For the quantitative data, a suitable conceptual model has to offer measures that 
can be collected and analyzed in an easy way, while still being meaningful and 
usable for the evaluation of query builders. In our evaluation, the decision falls 
on the software engineering standard ISO 9241-11 [4]. In particular, we consider 
a combination of ISO 9241-11 factors with measurable attributes, as follows. 
First, we use the effectiveness factor, i.e., the success in achieving goals. For the 
context of the study this factor is further decomposed into the accuracy (i.e., 
amount of hints given to the user during task completion) and completeness (i.e., 
task completion rate) of the tasks performed by the users. Then, we consider the 
efficiency, which describes the amount of resources that users spend to achieve 
their goals [4] and is measured by the time spent to complete a task. Finally, we 
use the satisfaction factor, i.e., the user’s positive attitude towards the tool [4]. In 
our evaluation, we make use of the System Usability Scale (SUS) questionnaire? 
[6], to provide a quantitative measurement about the users’ perceived usability 
of the tool, facilitating a comparison. 


3.2 Qualitative Data: Think-Aloud 


Qualitative data that are collected in usability evaluations typically consist of 
observational findings about the usability of design features. We make use of 
the think-aloud method [7], i.e., users are asked to voice their thoughts while 
trying to solve a predefined task. Their thoughts are then gathered in the form 
of a think-aloud protocol. This method has proven to be a reliable source of 
information, yet its application offers some challenges. In a realistic scenario, 
some users will have problems with voicing their inner speech. To counter this, 
we include a brief explanation of think-aloud in the pretest introduction and 
prompt participants to voice their thoughts during task completion. The think- 
aloud protocol on its own would be too inaccurate as it is often missing thought 
processes that are not verbalized. Thus, retrospective questioning is used in the 
evaluation, directly after testing the query builders, to insure that participants 
are still able to remember their thoughts. Participants are asked to recall their 
thoughts and opinions on certain points in an unstructured interview. 
Typically, experts would evaluate systems or tools on the basis of so-called 
usability heuristics [19] and through that be able to find problems in UIs. We fol- 
low this approach in our evaluation, and analyze the think-aloud protocols based 
on the ten usability heuristics by Nielsen [19] that provide general principles for 


5 Available at: https://bit-ly/2YuQyHJ. 
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the design of UIs. The heuristics observed in our evaluation are: visibility of sys- 
tem status (i.e., keeping users informed of the process), match between system 
and real world (i.e., speaking the user’s language), user control and freedom (i.e., 
enabling users to control their workflow and undo and redo actions), consistency 
and standards, error prevention, recognition rather than recall (i.e., minimizing 
the user’s memory load), aesthetic and minimalist design, and help and docu- 
mentation. 


3.3 Participants 


While an evaluation of systems such as query builders is typically carried out 
by field experts, our test users have to have no prior knowledge regarding LOD 
or SPARQL, so that the usability for casual users can be analyzed. Thus, our 
selected target group is digital natives that are versed in dealing with web tools 
and interfaces. 

We follow Nielsen’s studies on UI evaluations, estimating that the optimal 
number of participants for a medium-large project should include 15 users [20]. 
Thus, 15 bachelor students, ranging from 18 to 25 years old and evenly split 
between genders are finally selected to participate in our usability evaluation. 
The participants had never heard of or worked with SPARQL. 


3.4 Usability Evaluation 


The evaluation took approximately an hour per person and consisted of three 
parts: (i) the introduction that explained the test procedure to the participant, 
including the topic and evaluation approach of the user study; (ii) the testing of 
the query builders and the corresponding tasks; (iii) a debriefing where partic- 
ipants were able to verify their think-aloud messages and to add something to 
their protocols based on the ability to compare all tools. The order of the three 
tools was randomized before the evaluation. The tasks were printed out on a 
sheet of paper, and read aloud by the participants so that any questions could 
be clarified before users started the evaluation. 

The participants received the following tasks: (i) show all Austrian artists; (ii) 
show all volcanoes in Italy and their location (on a map); (iii) show all scientists 
born in Vienna after 1900 that have been awarded a Nobel Prize in Physiology 
and Medicine. The first task introduces the participants to the query builders 
and encourages them to interact with the interface and create a list of results. 
The second task is similar to the first one in that it combines a single object with 
one relation to a subject, however, it introduces the location. As Query VOWL 
does not offer an option to show the location on a map, the task is changed to 
output the location attribute. The third task is the most complex one as it is 
meant to show the expressiveness of the tools. It includes three objects that have 
to be put into relation with one another and a filter based on the birth date. 
For WQS the task was changed to search for humans instead of scientists, to 
output the items, as the corresponding results are saved differently in Wikidata. 
Furthermore, because the tool is missing a function to filter the results, the WQS 
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Fig. 4. Task completion (over a total of 15 participants) 


Table 2. Average hints per query builder and task 


Query builder | Task 1 | Task 2 Task 3 | Total hints 
WQS 1.07 |0.87 (0.67 | 2.60 
Query VOWL | 1.93 0.60 1.73 4.27 
SPARKLIS 1.80 1.20 1.53 4.53 


task was changed to additionally output the birth year instead. Hints that were 
given to users in the tasks were mostly based on the way the relations or filters 
were created. Users were never told where to click but were instead made aware 
that they had made an error and that their current approach would not work. 
If users were not sure about which relation they had to choose, a keyword was 
given to them, which was not counted as a hint. After finishing the tasks for 
one tool the participant fill out the SUS questionnaire and shortly answer the 
retrospective questions for the think-aloud protocol. 


4 Usability Results 


In the following, we present the results® of the user study divided into the usabil- 
ity metrics and heuristics that were gathered through the quantitative and quali- 
tative parts of the user study respectively. We then provide a discussion assessing 
the selected tools. 


4.1 Usability Metrics 


The usability metrics are analyzed by comparing the quantitative data of the 
corresponding query builders. 


Effectiveness. The effectiveness is comprised of the completeness and the accu- 
racy. Regarding the completeness, most of the participants were able to complete 


® All fine-grained results are available at: https://bit.ly/2YuQyHJ. 
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all the tasks for each query builder, as shown in Fig. 4. In particular, the same 
two participants failed at task three for both Query VOWL and SPARKLIS, and 
one participant only failed at the third task for SPARKLIS. 

In turn, Table 2 reports the accuracy results, measured by the average number 
of hints (per participant) that they needed for each task and query builder. 
Results show that, on average, participants required the most amount of hints 
to finish the first task. This typically corresponds to the familiarization with 
the environment. Overall, WQS needed the least amount of hints, about half as 
much as Query VOWL and SPARKLIS. 

Interestingly, while Query VOWL excels at the second task (i.e., users seemed 
to accurately understand and apply similar notions of the first task) it provides 
the worst results for the most complex third task. Finally, it is worth mentioning 
that no participant was able to finish all the tasks of a query builder without at 
least one hint. 

Both completeness and accuracy show that WQS offers the best effectiveness. 
It required the least amount of total hints and achieved the best task completion 
rate. The effectiveness of QueryVOWL and SPARKLIS showed only marginal 
differences. 


Efficiency. The efficiency was measured by the amount of time that was spent 
to complete a task. The results in Table3 show that, as expected, participants 
needed the most time to finish task three (i.e., the most complex one). In turn, 
users required the least amount of time to finish the second task, because of 
the aforementioned learning effect and its similarity to the first task. Overall, 
efficiency results are in line with the accuracy. Thus, users spend the least amount 
of time per task in WQS, while the results of Query VOWL and SPARKLIS are 
similar to each other. 


System Usability Scale. As mentioned in Sect.3, we measure satisfaction 
through the System Usability Scale (SUS) [6], in the range 0-100. The results in 


Table 3. Average time spent per query builder and task (in mm:ss) 


Query builder | Task 1 | Task 2 | Task 3 | Total time 
WQS 01:31 |01:45 | 01:52 | 05:09 
Query VOWL | 02:42 | 01:53 03:08 | 07:44 
SPARKLIS 02:26 |01:48 | 03:06 | 07:21 


Table 4. Average System Usability Scale (SUS) score per query builder 


Query builder | SUS score | Rating 
WQS 61 ‘OK’ 
Query VOWL | 50.5 ‘Poor’ 
SPARKLIS 48.5 ‘Poor’ 
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Table 4 show relatively similar scores for each tool. WQS reports the highest SUS 
score of 61, which can be interpreted as an ‘OK’ result using the adjective rating 
of the scale [3]. In contrast, the similar scores of QueryVOWL and SPARKLIS 
can both be rated as ‘Poor’. 


4.2 Usability Heuristics 


The think-aloud protocols were clustered via the so-called usability heuristics 
[19] (see Sect.3). We exclude the heuristics that were missing in the protocols 
and were not directly observed. In the following, we briefly summarize the results 
that were most frequently mentioned. 

Most QueryVOWL users complained about a lack of visibility of system 
status. The graph-based construction was lacking appropriate feedback in some 
cases and users were not able to see that input has been received. The NL- 
based SPARKLIS approach had less problems with the visibility, as feedback 
was provided instantly in NL form. However, users complained about the loading 
and response times of the tool. 

As for the match between system and real world, some users of WQS 
had problems deciphering the meaning of the used terms. As WQS only offered 
an option to add items as well as properties through the use of the filter but- 
ton, it resulted in some mistakes for the users. The terms that were used in 
QueryVOWL were taken directly from SPARQL and were confusing for some 
of the users. For example, users had problems deciding between using ’artist’ 
as a Class, individual or property, because they had no idea of the underlying 
data model. SPARKLIS managed to speak the user’s language through the use 
of non-system-oriented terms and a natural language query. 

Regarding the user control and freedom, most users complained about the 
missing undo and redo buttons for WQS and Query VOWL. SPARKLIS offered 
the best user control (e.g., undo and redo) and freedom for participants. 

As for consistency and standards, WQS and SPARKLIS offered standard 
buttons and users were able to use them easily. In contrast, none of the users 
was familiar with the visual style of Query VOWL. 

Regarding error prevention, all tools offered suggestions, enabling users 
to pick from available options. However, WQS was the only tool that offered 
spell checking and suggestions based on the input, a feature that users were 
missing in the other tools. For example, if ’geolocation’ was entered, it still 
showed ’coordinate location’. 

Concerning recognition rather than recall, users described some options 
of WQS as hidden away. Users complained about Query VOWLs options, in the 
nodes and sidebar, which were not selectable and visible in some cases. 

As for the aesthetic and minimalist design, most users complained about 
the clutter of information in parts of the interface of WQS, such as the suggestion 
box. Query VOWL was described as minimalist to the point of missing necessary 
information. Users complained about the amount of options and features that 
were shown at once in SPARKLIS. This led to them feeling overwhelmed and 
not in control at first. 
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Table 5. Summary of usability results per tool (bold values show the best tool for 
each metric) 


Query builder | Completion rate (%) | Average | Average SUS Usability 
hints time score problems 
(mm:ss) (0-100) 
WQS 100 2,60 05:09 61 4 
Query VOWL | 95,56 4,27 07:44 50,5 7 
SPARKLIS |93,33 4,53 07:21 48,5 3 


Finally, as for help and documentation, users liked the example queries 
of WQS and SPARKLIS; and Query VOWLs video. However, they would have 
preferred integrated tutorials over the available documentation of the tools. 


4.3 Discussion 


The results of the evaluation are summarized and discussed for each tool. An 
overview of the results, including the number of extracted usability problems, is 
shown in Table 5. 


WQS. WQS had the best results regarding the usability metrics. Users were able 
to achieve the best effectiveness and efficiency by a wide margin. The analysis 
of think-aloud protocols revealed four usability problems: the use of confusing 
terms, missing user control, non-selectable options and the complexity of parts 
of the UI. The problem that most participants encountered was the clutter of 
parts of the UI, such as the suggestion box. Furthermore, buttons and input 
fields were described as hidden away and led to confusion among users. Overall, 
WQS had the best results in this user study, most likely, based on the easy to 
use form-based approach with recommendations and suggestions enabled by the 
Wikidata API. However, it offered the lowest amount of expressiveness and was 
the only tool that was domain-dependent. 


Query VOWL. In short, regarding both the amount of hints given and the time 
per task, the first task of Query VOWL had the highest average of all tasks and 
tools. It was apparent that users had difficulties understanding the visual app- 
roach and the interface of the tool. The think-aloud protocols showed that both 
the frustration and difficulties that users experienced could be explained by the 
uncovered usability problems. Seven usability problems could be extracted from 
the protocols: the lack of appropriate feedback, use of confusing terms, miss- 
ing user control, non-familiar visuals, missing error prevention, non-selectable 
options and the UI missing necessary information. One of the most substantial 
problems was the missing visibility of system status and appropriate feedback. 
This, combined with the use of confusing terms and the lack of error prevention, 
led to making mistakes. The absence of undo and redo functions only ampli- 
fied these problems. However, it is important to note that the tested version of 
QueryVOWL was a prototype meant to demonstrate the querying approach. 
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SPARKLIS. Both the completion rate and the average total amount of hints 
were marginally worse than those of QueryVOWL. Three usability problems 
could be extracted from the think-aloud protocols: the lack of appropriate feed- 
back, missing error prevention and the overwhelming UI. Most users were over- 
whelmed by the amount of options and input fields that SPARKLIS offered and 
suggested a more minimalistic interface. SPARKLIS results were interesting in 
that the usability metrics results differed from the statements and think-aloud 
protocols of users. Most users were satisfied with the tool, and users that were 
not, said that they liked its approach after seeing all tools. This combined with 
the low amount of usability problems speaks for the usability of the tool. The 
think-aloud protocols showed that a better tutorial or a beginner friendlier inter- 
face would have led to a more usable tool. 


4.4 Lessons Learned 


Based on our results and the similarities regarding usability problems and user 
suggestions, we summarize key aspects for designing a query builder for knowl- 
edge graphs. 

First, results show that the querying approach is not as important for the 
usability of the tool as it may seem. User satisfaction and preference was mostly 
influenced by the interface design and ease of use of the tools. For example, 4 
users preferred the graph-based approach of Query VOWL even if they did not 
grade the tool itself as usable. 

Regarding the ease of use, the availability of suggestions had a great impact 
on users. Casual users are inexperienced and suggestions allow them to see possi- 
ble queries and subsequently understand the way queries are built. Participants 
suggested that tools could offer their most frequently built queries as examples, 
as (initial) queries built by casual users most likely would not differ too much. 

An important point for the interface design was not to overwhelm the user 
with options. The results of think-aloud protocols show that most participants 
disliked having too many options or fields for query input. In contrast, important 
functionalities should not be hidden away from users. The features that are likely 
to be used often, should be visible and selectable at all times. 

Regarding the usability heuristics, some features had a great impact on users. 
The ability to undo and redo parts of the query was praised when it was available 
and criticized when it was missing by a majority of participants. Casual users 
are especially prone to a trial and error method, which is why the possibility to 
undo errors as well as error prevention methods are so valuable. 

Concerning the documentation, participants of the study said that they were 
not likely to read tutorials or watch videos longer than a few minutes. A major- 
ity of users suggested to integrate tutorials in the query builder interface. The 
availability of tooltips could also improve usability. 

Finally, it should be noted that for most casual users the alternative to build- 
ing a SPARQL query to gather information is the use of traditional keyword- 
based web search engines. A majority of participants said that if they had the 
choice they would still use those search engines instead of any query builder 
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to solve tasks such as those of the user study. Therefore, query builders still 
need to somehow compete with this traditional mindset, keeping high usability 
standards while offering advanced functionalities to exploit the expressivity of 
SPARQL and the rich fine-grained information of (potentially interconnected) 
knowledge graphs. 


5 Conclusion and Future Work 


This paper presents a usability evaluation of SPARQL query builders for laypeo- 
ple, i.e., users that want to explore knowledge graphs but have no prior knowl- 
edge of SPARQL. We first categorize and analyze query builders based on their 
querying approach (i.e., form-, graph-, and natural language-based). We then 
select and evaluate three prominent representatives: the Wikidata Query Ser- 
vice (WQS), Query VOWL and SPARKLIS. 

Our user study is based on a mixed-method usability evaluation with three 
increasingly complex tasks (i.e. queries). On the one hand, we measure the effec- 
tiveness, efficiency and the System Usability Scale (SUS) score as quantitative 
data. On the other hand, we make use of the think-aloud method as qualitative 
data, clustering results based on usability heuristics. 

The results show that the form-based WQS offered the best usability of 
the three selected tools. However, usability problems were found for all tools, 
mostly concerning the difficulty of understanding and efficiently performing the 
query building process. Irrespective of the querying approach, users were mostly 
influenced by the interface design and ease of use of the tools. 

Finally, we extract key aspects for the interface design of future query 
builders. These include the availability of undo functions and error prevention 
methods as well as integrated tutorials, examples and suggestions to understand 
how queries are constructed for the underlying knowledge graphs. 

Our future work considers to expand the user study with a broader spectrum 
of queries and users, and the application of the lesson learned to build the next 
generation of query builders for knowledge graphs. 
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Abstract. Question answering engines have become one of the most 
popular type of applications driven by Semantic Web technologies. Con- 
sequently, the provision of means to quantify the performance of current 
question answering approaches on current datasets has become ever more 
important. However, a large percentage of the queries found in popular 
question answering benchmarks cannot be executed on current versions 
of their reference dataset. There is a consequently a clear need to curate 
question answering benchmarks periodically. However, the manual alter- 
ation of question answering benchmarks is often error-prone. We alle- 
viate this problem by presenting QUANT, a novel framework for the 
creation and curation of question answering benchmarks. QUANT sup- 
ports the curation of benchmarks by generating smart edit suggestions 
for question-query pair and for the corresponding metadata. In addition, 
our framework supports the creation of new benchmark entries by pro- 
viding predefined quality checks for queries. We evaluate QUANT on 653 
questions obtained from QALD-1 to QALD-8 with 10 users. Our results 
show that our framework generates reliable suggestions and can reduce 
the curation effort for QA benchmarks by up to 91%. 


Keywords: Benchmark - Question answering - Knowledge base 


1 Introduction 


Question answering (QA) engines are at the core of an increasing number of 
human computer interfaces, including personal assistants and chatbots [9]. The 
development of accurate QA frameworks for (RDF) knowledge graphs has hence 
become an endeavor of increasing importance and popularity [14,16]. Conse- 
quently, the provision of means to evaluate the performance of QA systems on 
current datasets is critical to (1) monitor the improvement of the state of art over 
past approaches and (2) provide realistic insights in relevant improvements for 
question answering systems on current challenges found in datasets. Benchmark 
series such as the Question Answering on Linked Data (QALD) series [15] address 
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this need for objective evaluation. They support QA researchers and developers 
by providing new versions of their benchmarks periodically. However, maintain- 
ing high-quality and current benchmark datasets is a challenging endeavor. In 
particular, changes in the knowledge base underlying the benchmarks (as well 
as metadata annotation errors) lead to a large proportion of the queries in pre- 
vious benchmarks not being executable on current versions of datasets. Table 1 
gives an overview of the extend of the degradation of the QALD benchmarks 
over time. A significant proportion of the SPARQL queries that were not modi- 
fied over time degraded (i.e., could not be executed) with newer versions of the 
knowledge base underlying QALD. For example, more than 30% of the QALD-4 
benchmark cannot be executed on DBpedia 2014, which was release a mere year 
after the publication of QALD-4. 


Table 1. Degradation of QALD benchmarks against various versions of DBpedia (in 
%). The numbers in brackets indicate total number of questions. 


DBpedia | QALD-1 | QALD-2 | QALD-3 | QALD-4 | QALD-5 | QALD-6 | QALD-7 | QALD-8 
version | (44) (87) (88) (177) (262) (350) (215) (219) 
3.6 18.18 

3.7 25.00 16.09 

3.8 31.82 20.69 17.05 

3.9 54.55 41.38 40.90 25.99 

2014 50.00 39.08 40.90 30.50 24.43 

2015-04 | 36.36 27.58 23.86 18.08 13.74 

2015-10 | 36.36 26.44 23.86 18.08 12.59 10.57 

2016-04 | 36.36 26.44 25.00 20.90 14.88 14.00 4.19 

2016-10 | 43.18 33.33 32.95 25.99 20.23 20.00 12.09 o 


Addressing the challenge of updating a QA benchmark to the current schema 
of a dataset is a tedious, time-consuming and error-prone endeavor (see Sect. 4 
for numbers). In this paper, we alleviate this problem by providing QUANT, a 
framework for the intelligent creation and curation of QA benchmarks. QUANT 
regards the i*” version B; of a QA benchmark as a pair (D;,Q;) composed of 
a dataset D; and a set of questions Q;. One of the core functions of QUANT 
is the generation of intelligent suggestions for benchmark curators (i.e., users 
annotating and improving a QA benchmark): Given a query qj; E Q; with zero 
results on Dp with k > i (i.e., on a newer version of D;), QUANT’s suggestions 
aim to provide a small number of modifications to qij, such that the modified 
qij i.e. qij» can be executed on Dx with non-zero results. We call this modifi- 
cation process for queries porting the queries from version 7 to version k. With 
these smart suggestions, QUANT aims (1) to ensure that queries from B; can 
be reused for Bp (e.g., as training queries) and (2) to speed up the curation 
process as compared to the commonly used manual and text-editor-based cre- 
ation and curation process [15]. To achieve this goal, QUANT (1) supports the 
creation of SPARQL queries answering a particular information need as well as 
the execution of said query against a predefined endpoint or knowledge base. 
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Moreover, QUANT checks (2) the validity of benchmark metadata as well as (3) 
the spelling and grammatical correctness of questions across multiple languages 
both in their natural-language query and keyword form. 

To demonstrate the usability of QUANT and the efficiency of the smart sug- 
gestions, we performed two extensive evaluation campaigns. First, we analyzed 
the performance gain using QUANT over the tradition manual curation process 
with 3 experts. The results show that we decreased the required curation time by 
91% while keeping the inter-rater agreement at 0.82. Second, we used QUANT 
to create a new joint benchmark from 8 QALD datasets. The smart suggestions 
were accepted by 83.75% of the users on average, indicating their usefulness. 
The novel, large and high-quality QA benchmark dataset, called QALD-9, is 
available at https://github.com/ag-sc/QALD/tree/master /9/data. 


2 Related Work 


The work on QUANT is related to three research areas, namely (1) workshops 
and evaluation campaigns, (2) datasets for QA over knowledge graphs and (3) 
curation tools for benchmarks. 


2.1 Workshops and Evaluation Campaigns 


A number of challenges and campaigns attracting researchers as well as indus- 
try practitioners to QA have seen the light of day over the last two decades. 
Since 1998, the TREC conference, especially the QA track [17], aims to provide 
domain-independent evaluations over large, unstructured corpora. The CLEF 
campaigns on information retrieval has a more than 10-year tradition in evalu- 
ating IR systems [1]. The well-known QALD (Question Answering over Linked 
Data) [15] campaign, currently running in its 9th instantiation, is a diverse eval- 
uation series which include questions, of which the answer can be computed 
(1) based on a single RDF knowledge base, (2) by combining RDF and textual 
data, (3) using several knowledge bases. The benchmarks cover several domains, 
including encyclopedia knowledge and music. Given that this series of bench- 
marks is openly available and widely used ([5,7] points to 30 systems, which were 
evaluated using QALD), we will use the QALD datasets to evaluate QUANT. 


2.2 QA over Knowledge Graphs 


Other QA datasets emerged apart from the above-mentioned challenges. 
LCQuAD [13] is one of the largest QA over knowledge bases benchmarks with 
5000 questions and their corresponding SPARQL queries over the 2016-04 version 
of DBpedia.! It also provides a framework for generating natural language ques- 
tions and their corresponding SPARQL queries, minimizing the domain expert 
intervention. However, these questions are often grammatically incorrect and 


1 http://dbpedia.org. 
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require manual paraphrasing. Out of the 5000 LCQuAD SPARQL queries, 2570 
queries could not be answered by the 2016-10 DBpedia version and interest- 
ingly, 456 queries were not answered by the 2016-04 version. We performed this 
evaluation before LCQuAD was updated.” Free917 [3] and WebQuestions® are 
widely used in the Semantic Web as well as Deep Learning community. Cai and 
Yates [3] manually created the Free917 dataset consisting of 917 questions and 
their logical forms, tailored to around 600 Freebase properties. Berant et al. [2] 
generated the WebQuestions dataset by using the Google Suggest API to collect 
1M questions and got a subset of them (100K) labeled on Freebase by Ama- 
zon Mechanical Turk works. Yih et al. [18] built the WebQuestionsSP dataset* 
by re-annotating the WebQuestions dataset. WebQuestionsSP, unlike its parent 
dataset, contains the natural language questions, their semantic parses in the 
form of SPARQL queries and the derived answers. For annotating the dataset 
with SPARQL queries, they designed a dialog-like user interface to fasten the 
process which is unfortunately no longer available. 


2.3 Curation Tools for Benchmarks 


Since both manual curation and crowd-sourcing for benchmark creation are 
tedious and time-consuming tasks, there is a need for tools that speed up the 
process while reducing annotation errors. Jha et al. [8] built Eaglet, a semi- 
automatic benchmark curation tool for named entity recognition and entity 
linking (NER/EL). The framework checks for anomalies in a gold standard, 
based on the rules derived from the existing gold standards for annotating doc- 
uments for NER/EL. Duan et al. [6] introduced an RDF storage benchmark 
generator to convert any dataset into a benchmark dataset (to reduce the gap 
between real and benchmark RDF data) for evaluating the performance of RDF 
stores, by formulating the benchmark generation problem as an integer program- 
ming problem. It is capable of generating data that resembles the characteris- 
tics (structuredness, size, and content) of real datasets with user-specified data 
properties. Lance [11] is a domain-independent, generic benchmark generator 
for Instance Matching systems; supports semantics-aware transformations with 
varying degrees of difficulty and creates a weighted gold standard for a better 
evaluation of the performance of instance matching tools. The interface accepts 
user-provided specifications to generate a benchmark. All the above-mentioned 
tools are similar to QUANT in the sense that they make benchmark generation 
easier for end users and employ strategies derived from an analysis on previous 
gold standards to improve the quality of the resulting dataset. However, to the 
best of our knowledge, there is no tool similar to QUANT in the domain of QA 
over knowledge bases. 


? http://le-quad.sda.tech/. 
3 https: //goo.gl/93iqgC. 
t http://aka.ms/WebQSP. 
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3 Approach 


3.1 Architecture and Workflow 


QUANT has a modular design comprising (1) a preprocessing module to elim- 
inate duplicates in case several datasets get loaded, (2) a machine translation 
module to automatically translate text into 10 languages, (3) a keyword gener- 
ation module to make the QA datasets suitable even for keyword-based infor- 
mation retrieval evaluations, and finally (4) a curation module to serve smart 
suggestions as can be seen in Fig. 1. 


Machine Keywords 


Translation Generation 


Dataset 


Dataset > Preprocessing D| Curation Curated 
N ) A 


aa 


A 
SPARQL Multilingual Multilingual 
Metadata Keywords Question 


Fig. 1. QUANT’s modular architecture 


A curation process consists of (a) a user logging into QUANT, (b) determin- 
ing an endpoint URI and version for the target knowledge base, (c) uploading a 
(QALD-JSON formatted) dataset. A user can then either d) create, (e) delete 
or (f) curate questions using smart suggestions. Finally, a user can export the 
curated dataset into the widely accepted QALD-JSON format [16]. Examples of 
a dataset formatted using QALD-JSON are available herein.” 


3.2 Smart Suggestions 


The most distinctive features of QUANT which enhance the overall curation 
productivity are smart suggestions for every attribute (SPARQL query or meta- 
data) in case they contain a wrong value. That is, the system automatically 
detects the presence of potentially incorrect entries and offers hints pertaining 
to how to correct them. In the subsequent paragraphs, we explain how we pro- 
vide smart suggestions for (a) question to SPARQL mappings as well as for (b) 
metadata attributes, and (c) question or keyword translations. 


5 https://github.com/ag-sc/QALD/tree/master/9/data. 


348 R. H. Gusmita et al. 


SPARQL Suggestion. There can be various reasons for a SPARQL query (S) 
that worked on a previous version of the knowledge base to not work against 
another version or another endpoint. If QUANT is not able to fetch results 
from the current SPARQL endpoint, it activates the SPARQL Correction and 
Suggestion curation module. QUANT either suggests a new SPARQL query or 
renders the failure case if the correction module fails, to allow for a manual 
curation by the user. The cases that we applied for SPARQL Correction have 
been described below. 


— Missing prefixes: QUANT first checks whether the query (S) fails due to 
missing prefixes. We call a prefix missing if it has been used in the query but 
not been defined in the beginning. An example of such a query is, 


select ?s where { res:New_Delhi dbo:country ?s .} 


where the corrected query is: 


PREFIX dbo: <http://dbpedia.org/ontology/> 
PREFIX res: <http://dbpedia.org/resource/> 
select ?s where { res:New_Delhi dbo:country ?s .} 


Henceforth, we assume that a query contains the correct prefixes. 

— Predicate change: For every triple in the query with either a known subject 
or object (pre: knownEntity in the example below), we check if the predicate 
changed in the underlying knowledge base. The original predicate is preserved 
and a new SPARQL query ($’) is formed to search for all the predicates that 
are associated with either the known subject or the known object. However, if 
in a SPARQL triple, both subject and object are unknown, then all previous 
triples (if present) (prevSubject prevPredicate prevObject.) are added 
to the new SPARQL query ($”). That is, if the triple being checked is of 
the form ?s pre:Predicate ?o and there exists a triple preceding it, that 
provides the value for either the unknown subject (?s) or the unknown object 
(?0) , then in the new SPARQL query (S’), the previous triple is used to limit 
the search space for predicate testing. The new SPARQL query (S’) can have 
one of the following forms. 

First form: 


select ?p where { pre:knownEntity ?p ?o } 


Second Form: 


select ?p where { ?s ?p pre:knownEntity } 


Third Form: 


select ?p where { ?unknownSubj prevPredicate prevObject. ? 
— unknownSubj ?p ?0.} or 

select ?p where { prevSubject prevPredicate ?unknown0bj. ?s ?p 
— ?unknown0bj.} or 
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select ?p where { ?unknownSubj prevPredicate prevObject. ?s ?p 
— ?unknownSubj.} or 

select ?p where { prevSubject prevPredicate ?unknown0bj. ? 
— unknown0bj ?p ?0.} 


All the resulting predicates that match or contain the original predicate’s 
label are stored. By replacing the original predicate with each of these stored 
predicates in the original SPARQL query, we check if the query produces 
non-zero results. If it works, we suggest this newly formed query to the user. 
Note, if we need to apply the third case, it is a match against all predicates 
in the knowledge base that arise from the result of the previous triple(s). 
Predicate Missing: If none of the resulting predicates match or do not contain 
the original predicate’s label, the user is informed about the missing predicate 
in the triple for manual curation. 

Entity Change: Each known entity (subject or object) in the triple is checked 
in the knowledge base. In DBpedia, if the entity is not found and belongs 
to a YAGO class, we append ‘Wikicat’ (which is a YAGO-specific update 
on DBpedia’s later versions) at the beginning of the entity label and check 
again. If this new YAGO class is present in the knowledge base, we check 
if the SPARQL query works with it and suggest it to the user, if it does. If 
the missing entity is not a YAGO class, we check if there is a redirection on 
DBpedia for this entity by using the following SPARQL query: 


select ?redirect where 
{ <entityToBeChecked> dbo:wikiPageRedirects ?redirect. } 


If a redirect is found, the updated SPARQL query is tested against the end- 
point and suggested to the user, if it returns an answer. Note, we were aware 
that this method is highly tailored towards DBpedia but can be adapted 
to any Linked Data knowledge graph using standard attributes such as 
owl:sameAs and skos:related. 

Entity Missing: The user is informed about the missing entity in the triple if 
the procedures to find an alternative entity fails. 


If there are no suggestions generated after performing the checks above, 


QUANT permutes the order of the triple patterns within the conjunctive clauses 
to which they belong and reruns the SPARQL correction pipeline. While the 
order of the triple patterns in such clauses does not matter, it does affect the 
search space when we test for entities and predicates. Hence, by changing the 
order of triples, we can either narrow down or broaden the search space and 
increase the probability of correcting the SPARQL query and returning a sug- 
gestion. 


The following examples depict SPARQL suggestions or messages returned by 


QUANT when it receives an outdated query. 


— Entity change: 


Degraded SPARQL query: 
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SELECT ?uri WHERE 
{ ?uri rdf:type yago:CapitalsInEurope } 


QUANT suggestion: 


SELECT ?uri WHERE 
{ ?uri rdf:type yago:WikicatCapitalsInEurope } 


— Predicate missing: 
Degraded SPARQL query: 


SELECT ?uri WHERE 
{ ?subject rdfs:label "Tom Hanks". 
?subject foaf:homepage ?uri } 


QUANT suggestion: 


The predicate foaf:homepage is missing in ?subject foaf: 
— homepage ?uri 


— Predicate change, Query Permutation: 
Degraded SPARQL query: 


SELECT ?date WHERE 

{ ?website rdf:type onto:Software . 
?website onto:releaseDate ?date . 
?website rdfs:label "DBpedia" . } 


QUANT suggestion: 


SELECT ?date WHERE 

{ ?website rdf:type onto:Software . 
?website rdfs:label "DBpedia" 

?website dbp:latestReleaseDate ?date . } 


Metadata Suggestion. QA benchmark metadata can be used to tailor bench- 
marks to the needs or research directions that a QA system follows, e.g., to 
ignore questions which need aggregation operations or to especially focus on 
them [5,7]. QUANT provides formal checks for the metadata entries found in 
QA benchmarks. 

For example, the answer type tag corresponds to the data type of the answer 
returned by the SPARQL endpoint. There are five possible data types (i.e., 
Boolean, Date, Number, Resource, and String). If the existing value of answer 
type is not suitable for the returned answer, QUANT will suggest the correct 
one based on a regular expression. The aggregation tag defines whether the 
SPARQL query contains one or more aggregation functions such as COUNT, 
SUM, AVG, MIN, MAX, SAMPLE, GROUP_CONCAT, VECTOR_AGG, and 
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COUNT DISTINCT.® If the SPARQL query contains at least one of these func- 
tions, aggregation must be set to true, otherwise it is set to false. QUANT detects 
the presence of these functions in the query and suggests the correct value. The 
hybrid metadata entry describes whether it is required to search not only the 
Linked Data knowledge base but also textual data to produce an answer. The 
SPARQL query of a hybrid question will mostly contain the phrase text : query 
or if:contains in it. In this case, this attribute must be set to True. onlydbo 
is a binary flag which states whether the SPARQL query contains URIs which 
belong exclusively to the DBpedia namespace. QUANT examines all the URIs, 
both long forms and abbreviations,’ in the SPARQL query to check if they 
belong to DBpedia and suggests the correct value for this field. owt-of-scope tag 
denotes a SPARQL query that is not able to retrieve answers from a SPARQL 
endpoint or when the answers are not semantically correct. If this is the case, 
out-of-scope must be set to true. 


Multilingual Questions and Keywords Suggestion. To enable multilin- 
gual QA evaluation campaigns and foster more active research in this area, QA 
benchmarks are often made available in several languages. However, translating 
queries across languages in a consistent way entails a significant amount of man- 
ual effort. In the case of missing or incomplete translations, QUANT first applies 
stopwords and question-word removal techniques to generate missing keywords. 
Here, we rely on technique similar to those implemented in FOX [12]. Secondly, 
our framework applies an automated machine translation tool called Translate 
Shell® to provide translation-suggestions in 10 other languages for both ques- 
tions and keywords. As machine translation is not perfect, the completion of the 
final translation remains the curator’s task. 

Figure 2 shows a screenshot of the framework which displays curation process. 


4 Evaluation 


Our evaluation had three goals: (1) compare the curation time using QUANT 
with manual curation time, (2) investigate the effectiveness of smart suggestions, 
and (3) determine how capable QUANT is in providing a high-quality benchmark 
dataset. 


4.1 Efficiency Evaluation 


First, we analyzed the performance gain using QUANT versus a manual curation. 
Our annotators were three graduate CS students with a good working knowl- 
edge of Linked Data. To avoid any inherent bias, the three graduate students 


ê https: //www.w3.org/TR/sparql11-query/#Aggregates. 
T https: //www.w3.org/TR/2013/REC-sparql11-query-20130321/#sparqlSyntax. 
8 https: //www.soimort.org/translate-shell/. 
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Question Status: not curated 


Language Keywords 


Fig. 2. Screenshot of QUANT’s curation process 


worked sequentially and without any prior knowledge of the data. They had 
to curate 50 questions manually and subsequently curate 50 different questions 
using QUANT. The results show that we decreased the needed time by 91% while 
keeping the inter-rater agreement from two of the users at 0.82, which stands 
for almost perfect agreement [4]. On average, users needed 23 min (between 22 
and 25min) using QUANT as opposed to 278 min (between 240 and 330 min) 
on average (more than 10x) using a manual curation approach. 


4.2 Smart Suggestion Evaluation 


Second, we used QUANT to create a new joint benchmark by joining the past 8 
QALD datasets together and unifying them. We divided 10 expert users (Ph.D. 
students and senior researchers) into 5 pairs. The members of each pair had to 
curate exactly the same questions. The first four user pairs curated 130 ques- 
tions, while the user pair worked on another 133 questions. This resulted in 653 
questions, see also Sect. 5. 

We monitored the number and types of suggestions accepted by the users 
throughout the curation process. Our evaluation results show that from 2380 
suggestions provided by QUANT in total the acceptance rate from all the users 
was 81.04% on average (see Fig. 3). As seen in Fig. 4, most users accepted sugges- 
tions for the out-of-scope metadata, which, after correcting the SPARQL query, 
entailed a change to the questions’ metadata.? Keyword and question transla- 
tion suggestions yielded the second and third highest acceptance rates. We got 
higher acceptance rates (over 90%) mostly on questions from the later versions 
of QALD (QALD-7 and QALD-8), i.e., by users 1, 2, 9, and 10. Despite the fact 
that the questions are handed out to the curators in chronological order, we saw 
no effect of this ordering in the acceptance rate. Interestingly, the acceptance 


? Note that if a user changed the SPARQL query manually using the hint from the 
suggestion, it is not added to the statistic. 
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rate is independent of the number of suggestions. 83.75% of the users accepted 
QUANT’s smart suggestions on average, see Fig.5. However, the hybrid meta- 
data attribute and the SPARQL suggestions were only accepted by 2 and 5 
users respectively. We were also interested to know how many attributes were 
changed without using smart suggestions and redefined by users directly. Dur- 
ing the evaluation with 10 users there were 4 attributes changed without using 
the suggestions, see Fig.6. These are answer type, onlydbo, out-of-scope, and 
SPARQL query. 
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Fig. 6. Number of attributes whose value are provided by users 


Finally, we computed the inter-rater agreement between each pair of users 
which shared the same questions. Our results are shown in Table 2 and suggest 
from very good to almost perfect agreement among the users [4]. This is very 
positive result as it suggests that our framework provides consistently helpful 
suggestions to its users. 


Table 2. Inter-rater agreement over 5 annotator pairs curating at least 130 questions 


Group Inter-rater agreement 
‘Ast two-users |097 
2nd two-users | 0.72 

3rd two-users | 0.88 

4th two-users | 0.77 

5th two-users | 0.96 


5 QALD-Specific Analysis 


In total, there are 1924 questions where 1442 questions are training data and 482 
questions are test data across the different versions of QALD we considered. So 
far, novel QALD train datasets were created by merging the test and training 
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questions of the previous QALD version. The test dataset for a new QALD 
version is normally based on completely new questions extracted from search 
engine or chatbot log files [15]. It can be seen in Fig. 7 that the real distribution 
of QALD-train dataset in almost all versions unfortunately does not represent 
the ideal distribution. The change of the knowledge base contributes in the sense 
it causes several questions become unanswerable so that they have to be removed 
from dataset. 
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Fig. 7. Ideal and real distribution of QALD training data in all versions 
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Fig. 8. Distribution of unique questions in all QALD versions 


Our analysis discovered there are many exact duplicates, i.e., questions which 
were exactly the same in all attributes, in most QALD versions. We solved this 
problem by taking the one from latest version as it is more mature with respect 
to correctness and completeness of the question’s attributes. Furthermore, there 
were only 655 unique questions as seen in Fig.8. Sequentially, we removed 2 
semantically similar questions so that finally we have 653 questions in total. 
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After applying the smart suggestions via 10 expert users, QUANT was able 
to produce a QALD-JSON formatted dataset of 558 total benchmark questions, 
increasing the size of QALD compared to QALD-8 by 110.6%. This dataset forms 
the QALD-9 dataset [10]. In particular, questions previously marked as out-of- 
scope in past challenges were curated such that they are now a valid question, 
and are thus treated as novel questions in this new QALD dataset. 


6 Conclusion and Future Work 


QUANT’s evaluation highlights the need for better datasets and their main- 
tenance. The degradation of datasets linked to the growing amount of Linked 
Data-based knowledge bases builds a barrier to novel research methods which 
are demanding large amounts of high-quality training data. We were able to 
show that QUANT speeds up the curation process by up to 91%. Furthermore, 
we saw that smart suggestions motivate users to engage in more attribute cor- 
rections than if there were no hints, compare Figs.4 and 6. Also, we pointed 
out that we need to invest more time into SPARQL suggestions as only 5 users 
accepted them. This low acceptance rate is due to the tremendous changes in 
the underlying ontologies from one version to the other. Of course, we plan to 
support more file formats based on our internal library.!° 
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Abstract. In this paper we present the Simple-ML framework that we 
develop to support efficient configuration, robustness and reusability of 
data analytics workflows through the adoption of semantic technologies. 
We present semantic data models that lay the foundation for the frame- 
work development and discuss the data analytics workflows based on 
these models. Furthermore, we present an example instantiation of the 
Simple-ML data models for a real-world use case in the mobility domain. 


1 Introduction 


The creation of a Data Analytics Workflow (DAW) demands significant data 
science expertise. This expertise is required to integrate data from heteroge- 
neous sources, to extract features for machine learning (ML) tasks, to configure 
the DAW and to optimize its parameters. The Simple-ML framework, which 
we currently develop to address these challenges, aims to enable a robust, effi- 
cient and reusable DAW configuration through seamless integration of semantic 
information in all typical DAW components, making it a Semantic Data Analy- 
tics Workflow (SDAW). The adoption of semantic information, such as a domain 
model and semantic dataset profiles, substantially differentiates Simple-ML from 
existing data science frameworks such as RapidMiner or Microsoft Azure. 

In this paper we present Simple-ML and illustrate its adoption to data ana- 
lytics for urban mobility. Popular problems in this domain include short-term 
road traffic forecasting [5], the prediction of congestion patterns [7] and impact 
prediction of planned special events [8]. The corresponding SDAWs require a 
variety of heterogeneous data sources, including but not limited to traffic and 
mobility data streams, map data (e.g. OpenStreetMap), knowledge graphs con- 
taining events and spatial entities (e.g. EventKG [3] and Wikidata), as well as 
traffic warnings, accidents, weather conditions and event calendars [5,8]. 

Our contributions are as follows: (i) We propose the Simple-ML framework 
for SDAWs: a semantic-driven approach that aims at increasing the efficiency of 
the workflow configuration, as well as robustness and reusability of DAWs using 
semantic technologies. (ii) We introduce a domain-specific semantic data model 
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that provides semantic descriptions of the application domain and domain- 
specific relevant datasets (i.e. dataset profiles). (iii) We illustrate an application 
of the Simple-ML framework to a real-world use case in the mobility domain. 


sml:DomainClass L}- St:mapsTo 0." sml:Mapping læ smi:mapsTo 1f rdf:Property 


Domain Property 


4.* 0..1 
sml:hasClass sml:hasMapping 
L 1 


sml:DomainModel| | sml:MobilityClass sml:Attribute 
1..n 
sml:hasAttribute 
1 
7 Se . deat: y 
sml:MobilityModel dcat:Dataset on dataset 1 C] dataset 4 dcat:Catalog 


Fig. 1. An UML class diagram illustrating the Simple-ML domain model, the data 
catalog and a partial instantiation of the domain model in the mobility domain. 


2 Semantic Models for SDAWs 


The goals of Simple-ML are realized through a domain model (Fig. 1), semantic 
dataset profiles and the SDAW. We conduct the modeling in RDF! reusing 
existing vocabularies (e.g. dcat?), where possible. The terms specific to Simple- 
ML are defined in the Simple-ML vocabulary, denoted using the sm1 prefix’. 


Domain Model: In Simple-ML, the domain model describes relevant con- 
cepts, their properties and relations in the specific application domain. The class 
sml:DomainModel represents the model of an application domain. The domain- 
specific concepts are modeled as instances of the class sm1:DomainClass. 


Dataset Profiles: A dataset profile is a formal representation of dataset cha- 
racteristics (features). A dataset profile feature is a dataset characteristic. Such 
features can belong to general, qualitative, provenance, statistical, licensing and 
dynamics categories [1]. In Simple-ML, the goal of the dataset profiles is to 
define dataset characteristics required to facilitate SDAWs, including informa- 
tion required for data materialization. 


Dataset profile: A dataset profile is modeled as an instance of dcat:Dataset. 
General dataset profile features as well as provenance and licensing features 
are described using the DCMI Vocabulary (dcterms) Statistical dataset profile 
features (e.g. the number of instances) can be provided at the dataset and the 
attribute levels. 


Dataset attributes: The attributes of the dcat:Dataset are modeled as instances 
of sml:Attribute. An attribute is described through its statistical characteristics 
at the instance level (e.g. the mean value sml:meanValue), along with the access 


' Resource Description Framework (RDF): https://www.w3.org/RDF/. 

? Data Catalog Vocabulary (DCAT): https://www.w3.org/TR/vocab-dceat /. 

3 The list of the adopted namespaces and the data catalog are available online: https: ee 
simple-ml.de/index.php/data-catalog/. 


Simple-ML: Towards a Framework for Semantic Data Analytics Workflows 361 


information to the underlying data source (e.g. the column name in a relational 
database) to facilitate data access and materialization. 


Dataset access: Simple-ML supports access to datasets through dedicated 
attributes that represent physical storage location and data format (e.g. sml: 
fileLocation and csvw:separator). Currently, relational databases (sml: 
Database) and text files (sml:TextFile) are supported. 


Mapping between the Dataset Profile and the Domain Model: Dataset 
attributes are mapped to the concepts in the domain model (sm1:DomainClass) 
through the sml:Mapping class, as illustrated in Fig.1. This mapping adds 
domain-specific semantic description to the dataset attributes and facilitates 
their use in the SDAWs. The class sml:Mapping provides two properties: sml: 
mapsToProperty to map a dataset attribute to a property in the domain model, 
and sml:mapsToDomain to specify the rdfs:domain of this property, which is an 
instance of sml:DomainClass. 


Data Catalog: Dataset profiles are organized in a domain-specific data cata- 
log. The extensible Simple-ML data catalog is modeled as an instance of dcat: 
Catalog. The data catalog schema including representations of dataset profiles 
and the mapping to the domain model is illustrated in Fig. 2. 
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Fig. 2. The data catalog schema based on the dcat vocabulary. Arrows with an open 
head denote the rdfs:subClassOf properties. Regular arrows denote the rdfs:domain 
and rdfs:range restrictions. Blue boxes denote the key dcat and sm1 classes. (Color 
figure online) 


3 Semantic Data Analytics Workflow (SDAW) 


Figure 3 depicts an overview of a Semantic Data Analytics Workflow (SDAW). 
A SDAW consists of several steps discussed in the following. 
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Fig. 3. An overview of the Simple-ML Semantic Data Analytics Workflow (SDAW). 


Iterative Generation of a Semantic Data Specification: In this first step, 
the user defines the semantic specification of the data to be used in the workflow. 
The input in this step is the data catalog. The specification is defined through the 
selection of the operations to be applied to the dataset(s) in the data catalog 
and their attributes. Possible operations include dataset selection, sampling, 
feature selection, feature extraction and data integration. These operations can 
be applied iteratively in a user-defined order. The Semantic data specification 
is defined at the metadata level using dataset profiles and does not require any 
physical data access. The specification can be stored to facilitate reusability. 
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Fig. 4. An example domain model for the mobility domain. The arrows with an 
open head denote the rdfs:subClassOf properties. Regular arrows denote the rdfs: 
domain and rdfs:range restrictions. Classes in green boxes are sub classes of sml: 
MobilityClass. 
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Data Materialization: The data specification configured during the previous 
steps is applied to the physical datasets to materialize the integrated data. 


Semantic Machine Learning Workflow (SMLW): The domain model is 
complemented with a ML domain model that captures the essential properties of 
ML concepts and their implementation in specific frameworks. A domain specific 
language (DSL) for SDAWs and SMLWs will include an advanced type system 
that will use metadata from the application domain to describe datasets and the 
intermediate results of data processing on one hand, and the metadata of the ML 
domain to describe the ML processing steps. This will enable statically checking 
the correctness of applying particular ML methods to particular data. To this 
extent, we will build upon previous approaches aiming to integrate ontologies 
into existing type systems (see e.g. [4]). We will go one step further, by designing 
a language dedicated to the data analytics and ML domain and including data 
models both for the data and also for the ML processes. 


Result Visualization: The domain model can be used to automatically suggest 
suitable visualizations for specific data types. 


4 Domain Model for Mobility 


Figure 4 exemplifies an instantiation of the domain model in the mobility domain. 
This model includes the following classes: 


— sml:FloatingCarDataPoint: A vehicle’s type, position, time and speed. 
— sml:TrafficFlow: Vehicle count statistics (e.g. from road sensors [7]). 


sml:SimpleMLCatalog a dcat:Catalog ; 
dcat:dataset sml:FCDDataset . 
sml:FCDDataset a dcat:Dataset ; 
dcterms:title "Floating Car Data" ; sml:hasFile sml:FCDDatasetFile ; 
dcterms:temporal [ so:startDate "2017-08-01"**xsd:date ; 
so:endDate "2017-12-31"**xsd:date ] ; 
sml:hasAttribute sml:FCDDatasetAttributel . 
sml:FCDDatasetFile a sml:TextFile ; 


dcterms: format "text/comma-separated-values" ; csvw:separator ";" 
sml:FCDDatasetAttributel a sml:Attribute ; 

rdfs:label "vehicle id"@en ; sml:columnNumber "0"**xsd:integer 

sml:hasMapping [ sml:mapsToProperty sml:carlId ; 


sml:mapsToDomain sml:FloatingCarDataPoint ] 


Fig. 5. An excerpt of an example data catalog in the mobility domain. 


SELECT ?columnNumber ?attrName ?mapProperty ?mapDomain WHERE { 
sml:FCDDataset sml:hasAttribute ?attribute . 
Pattribute dcterms:identifier ?attrName . 
Pattribute sml:columnNumber ?columnNumber . 
OPTIONAL { ?attribute sml:hasMapping [ 
sml:mapsToProperty ?mapProperty ; sml:mapsToDomain ?mapDomain ; ] . } } 


Fig. 6. SPARQL query to select attributes of a given dataset (here: sm1:FCDDataset). 
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— so:Event: Mobility-relevant events, their time and geographical location. 

— sioc:Post: Social media posts modeled using the SIOC ontology’. 

— sml:WeatherRecord: Temperature and rainfall at location and time. 

— dcterms:Location: Spatial information with geographical coordinates. 

— sml:SpeedLimit, sml:AccidentType, sml:VehicleType: Classes that repre- 
sent categorical values for speed limits, accident types and vehicle types. 


These classes are sub classes of sm1:MobilityClass, which is a sub class of 
sml:DomainClass and thus allows the use of sm1l:Mapping as shown in Fig. 2. 
Figure 5 provides an excerpt of an example Simple-ML mobility data catalog. 


5 Simple-ML Application to Traffic Speed Prediction 


We illustrate the iterative generation of a semantic data specification for the 
problem of traffic speed prediction for a specific road segment at a given time. 


Dataset Selection: The user selects a Floating Car Data (F) and Open- 
StreetMap (O) datasets. Figure 6 shows the SPARQL query to retrieve F’s pro- 
file. 


Data Specification: (i) Feature Selection: The user selects four features 
based on the domain model: sml:maxSpeed, sml:hasTime from (F) (class sm1: 
FloatingCarDataPoint), and rdf:type and sml:maxSpeed from (O) (class sm1: 
StreetSegment). (ii) Feature Extraction: The user selects the following temporal 
features that are suggested by the system: week day, hour of day from (F). (iii) 
Data Integration: A mapping between the vehicle positions in (F) and the street 
segment coordinates in (O) is suggested by the system and chosen by the user. 


Data Materialization: Using the data specification, relevant features are mate- 
rialized, with example instances shown in Table 1. The resulting data can then 
be used in the SMLW to train a supervised traffic speed prediction model. 


Table 1. Example instances generated using the semantic data specification 


FloatingCarDataPoint (F StreetSegment (O) 

Type | Speed | Time (day) | Time (hour) | Type maxSpeed 
1 74 Sunday 23 motorway-_link | 80 

0 84 Sunday 16 motorway none 

1 17 February 8 secondary 70 


* https: //www.w3.org/Submission/sioc-spec/. 
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6 Related Work 


Recent works [2,4,6] aim to combine semantics and ML to address a variety of 
real-world problems. Simple-ML goes one step further and makes use of seman- 
tics in the entire DAW. Simple-ML employs dataset profiles and domain-specific 
data models. The survey [1] provides a comprehensive overview of RDF dataset 
profiling methods, tools, vocabularies and features partially utilized by Simple- 
ML. We illustrate the use of Simple-ML in the mobility domain. Mobility has 
seen many challenges and use cases for data analytics [5,7,8]. In Simple-ML, the 
mobility domain is modeled in a light-weight, data-driven manner that facilitates 
compatibility and reusability of the SDAWs across use cases and datasets. 


7 Conclusion 


In this paper we presented our current development towards the Simple-ML 
framework. Simple-ML adopts semantic technologies to support the efficient cre- 
ation, configuration and reusability of robust data analytics workflows. We illus- 
trated an application of the framework to a real-world use case in the mobility 
domain. 
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Abstract. RDF stores provide a simple abstraction for publishing and 
querying data, that is becoming a norm in data sharing practice. They 
also empower the decentralised architecture of data publishing for the 
Web or IoT-driven systems. Such architecture shares a lot in common 
with blockchain infrastructure and technologies. Therefore, there are 
emerging interests in marrying RDF stores and blockchain to realise 
desirable but speculative benefits of blockchain-powered data sharing. 
This paper presents the first RDF store with blockchain that enables 
lightweight edge devices to control of the data sharing processes (per- 
sonal, IoT data). Our novel approach on the deep integration of the 
storage design for RDF store enables the ability to enforce controlling 
measures on access methods and auditing policies over data elements via 
smart contracts before they fetched from the sources to the consumers. 
Our experiments show that the prototype system delivers an effective 
performance for a processing load of 1 billion triples on a small network 
of lightweight nodes which costs less than a commodity PC. 


Keywords: Blockchain - Linked Data - RDF store 


1 Introduction 


User-generated and sensor data is being widely used by various applications 
to make them smarter and better. While such applications are using the data 
for nearly free, such consumer data from both public and private sources is 
incredibly valuable to corporations, marketers, investors, and individuals. For 
example, American companies alone are estimated to have spent over $19 billion 
in 2018 acquiring and analysing consumer data, according to the Interactive 
Advertising Bureau.! There are recent recurrent questions of how users can take 
control and make the benefit out of it. It is obvious that in order to take control 


1 https: //www.iab.com/news/2018-state-of-data-report/. 
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of one’s data, there must be an ability to accurately account for ownership, and 
similarly account or keep a record of all transactions, exchanges and permissions 
while in a secure and tamper-proof manner. 

The arrival of blockchain technologies gives the promise to get your data 
out of a corporation’s centralised database to store your own devices or your 
encrypted storage of your choice at the edges of networks. This will be critical 
in helping develop transparency and accountability in data sharing as we take 
ownership of our data. Blockchain provides a number of substantive benefits. It 
provides a layer of transparency and accountability for data ownership and trans- 
actions. It can also minimise the influence of data middlemen in any consumer 
data transaction while putting the data back in the hands of the consumer. This 
means users can regain control of who uses our data, when our data is used and 
our compensation for it. 

In parallel, the recently emerging trend of edge computing paradigm for IoT- 
driven information systems makes it much more feasible to push computation 
and data management operations closer to the data sources. Being able to store 
and query data at the edges of networks offers opportunities to improve perfor- 
mance and to reduce network overhead, but also flexibility for the continuous 
integration of new IoT devices and data sources. This motivates us to build a 
novel distributed RDF store which leverages blockchain benefits for data pub- 
lishers at the edges of networks. To the best of our knowledge, our system is the 
first of this kind. The system design will be presented in Sect. 2, the implemen- 
tation report will be followed in Sect. 3. Our experimental results in Sect. 4 show 
that a small cluster of Raspberry Pis (cost less than a commodity workstation) 
can efficiently handle 1 billion RDF triples of IoT data. 


2 System Overview 


To store and to share RDF data on the edge with the guarantee of the data own- 
ership and the compensation for shared data, we marry a distributed database 
system and blockchain. Confronting the decentralised edge networks, the dis- 
tributed database system allows any members to share their data, resources 


SPARQL Query Processor 
mart ContractManage-N 8 ar a aaa O 7 aca Tae id 
2i Ss i 
Bg: wee 28 © i 
Distributed 21® 282@ Opg ms i 
RDF Storage $8: B58 ac z F : 
i std È td 1 
eon y Smart ¥ gs: |© 
R Crontracts Smari 5E. z 
Validation (~ sc! J0 
Service VOR TS > Contracts oW, os 
AA Trigger Storage z5: Re 
Contracts A g ov! Lai % 
ec: 
Smart Contract Manager ®! 
m= i 
OES 
Ox 
0 i 
y 


Distributed RDF Storage 


(a) System overview 


(b) General procedure 


Fig. 1. System overview and general procedure 
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while maintaining their autonomy and independence from centralised servers. 
A feature of blockchain technology is the smart contract which allows the data 
owners to control the access to their data (e.g., who can have the access or how 
much to pay to gain the access). 

Our system (Fig. la) consists of two subsystems: a Distributed RDF Storage 
(DRS) and a Smart Contract Manager (SCM). The DRS takes responsibility to 
store RDF data among connected devices in the network physically. Meanwhile, 
to secure the ownership of the data, the access to the data in the distributed 
storage is encrypted into smart contracts which are published and managed by 
the SCM. The basic principle of the system is that when a provider wants to 
trade his/her data, she/he publishes the data partitions (e.g., a set of sensor 
readings) associated with smart contracts. A smart contract contains the meta 
information of a published data partition, for example index key, and a contract 
that can be used to specify terms like price scheme and access control policy on 
data to be fetched to clients. 

Considering that to answer a SPARQL query, the SPARQL query proces- 
sor performs graph pattern matching over RDF datasets. The graph matching 
operator executes join operations between RDF triples that match the triple 
query patterns. Therefore, to retrieve query pattern matched data, we organise 
RDF data in the similar way as RDF4Led [4] does. We store RDF triples in 
three storage layouts as sorted permutations of triples: SPO (Subject - Predi- 
cate - Object), POS and OSP. Each layout is a sorted list and is partitioned into 
chunks called data blocks. The first triple of each data block and the physical 
address of the data block are formed an index entry which is kept in the main 
memory. The triples of the index entries are the keys for searching the data 
blocks that potentially contain the matched triple of a query pattern. However, 
instead of storing the data blocks in the local file systems as in RDF4Led, we 
use distributed file systems to create the DRS subsystem. The distributed file 
systems provide scalable data distribution and sharding featured associated with 
distributed data structures like DHT [1]. In such data structure, each data block 
that is stored is mapped to a unique identifier. The identifier space is partitioned 
among the nodes. Each node is responsible for storing all the data blocks that are 
mapped to identifiers in its portion of the space. Hence, data is distributed and 
resource consumption is balanced among the edge nodes. Furthermore, instead 
of keeping the unique access identifiers of data blocks in index entries openly, 
we allow publishers to encrypt them into smart contracts and keep them in the 
Smart Contracts Storage (SCS) of the SCM (see Fig. 1b). In the Smart Con- 
tracts Storage, the smart contracts are kept sorted by the triple, therefore, the 
corresponding smart contracts that hold the query pattern matched triples can 
be searched as described in [4]. Finally, a data block can be fetched only when 
the access identifier is revealed, in consequence, only when the smart contract 
that holds identifier of the block is triggered. 

The general workflow of the system is visualised in Fig. 1b. When a client 
starts sharing her/his RDF dataset, the system indexes the data, partitions the 
indexes into data chunks, stores the data chunks in the DRS and sends the 
index entries to the SCM (1). The SCM encodes the arriving index entries into 
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smart contracts and stores these contracts in the Smart Contracts Storage. From 
the given triple requests (2), the SCM searches in the SCS for the contracts 
that hold the accesses to the requested data. Later, these contracts are sent 
to the Validation Service which developed with blockchain technology (3). The 
Validation Service verifies the if contracts are validated, and navigates validation 
requests to the SPARQL Query Processor (4). When the contracts are validated 
(5), the Validation Service triggers these contracts (6), returns the opened index 
entries with access identifiers the SPARQL Query Processor (7). With these 
index entries and access identifiers, the SPARQL Query Processor should be 
able to fetch the matched triples from the DRS to join operators (8) (9). 


3 Implementation and Deployment 


To implement the system, we extended the Java code base of RDF4Led [4] that 
allows RDF data processing tasks (e.g., parsing, indexing) to execute on the edge 
node. The DRS subsystem is implemented by re-engineering the Physical Layer 
to store the data blocks (which contain RDF molecules) in the p2p file system 
IPFS [1]. IPFS is a secure, high-throughput, distributed block storage model with 
content-addressed hyper links. It allocates a unique hash for each stored block of 
data. In IPFS, the data blocks are identified and retrieved by their IPFS hashes. 
In SCM, the SCS is stacked on top of the Buffer Layer. The smart contract and 
related features are implemented with Ethereum?. However, instead of using an 
in-memory caching of RDF4Led, the index entries which also keep the address of 
their corresponding smart contracts are stored in Redis®. Redis is a distributed 
in-memory key-value data that allows data is clustered in memory of multiple 
devices. When a devices join the network, it also contributes its computational 
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resources to the whole system. To communicate the Validation Service with the 
Ethereum’s blockchain network, we use Web3j of Ethereum. 

Figure 2a illustrates the physical data organisation of SPO index layout in 
our system. In the DRS layer, sorted list of triples are partitioned, compressed 
as molecules and stored in IPFS as byte arrays. The associated IPFS hash of 
RDF molecules are packed with addresses of the smart contracts. The smart 
contracts are stored in Ethereum blockchain and are programmed to return the 
stored IPFS hashes if the transactions are triggered. In the SCS, each entry 
contains the first triple of each molecule and its smart contract id. These entries 
are kept in sorted list of Redis which provides the key-range search that allows 
index lookup in the same fashion as in RDF4Led. 

Figure 2b depicts our system’s deployment strategy. The nodes are installed 
with IPF'S and Redis which make our SCS and DRS layer respectively. A node’s 
IPFS and Redis clubs with another node’s IPFS and Redis to form a cluster 
respectively within the network to provide a decentralised distributed data stor- 
age. The Validation Service Cluster is implemented using Ethereum’s Private 
Blockchain using Geth which in turn is configured to use Proof-of-Authority con- 
sensus mechanism. The remaining nodes are set up as full-nodes clients, which 
provide additional storage to the service. 


4 Evaluations 


This section presents our evaluation on the deployment as presented in Fig. 2b. 
We created a network of 15 Raspberry Pi 3 model B (ARMv7 Quad-Core 1.2 GHz 
CPUs, 1 GB RAM, 64GB SD card, Raspbian OS), each node costs approximately 
€50. Note that the total cost of the whole setup is less than a commodity PC. 
The RDF dataset for the evaluation is generated by mapping sensor readings 
from NOAA“ dataset to RDF using SSN/SOSA ontology [3]. The schema of our 
data set is provided a long with the implementation in our Github repository”. 

We evaluated our system with two experiments. Experiment 1 consists of two 
tests on two system settings. Firstly, we fixed the cluster size at 10 nodes and 
measured the average throughput per node when inserting more data. Secondly, 
we observed the increasing of the throughput when adding more node. A billion 
and 100 million triples dataset was used respectively in the first experiment. In 
Experiment 2, we measured the response time for searching the matched triples 
of single query patterns on 1 billion triples dataset. 

Figure 3a represents the first setting’s result with the accumulated through- 
put of a fixed number of nodes. As predicted, the throughput gradually decreases 
when data stored in the storage increases. After 500 million inserts, we can see 
a stagnant flow in the graph. The result of the accumulated throughput of a 
varying number of nodes in the cluster is presented in Fig. 3b. Here, we plot 
throughput (triples/seconds) in thousands against the number of nodes partici- 
pating. It appears as the number of processing nodes in the cluster increases, the 


4 https: //www.ncdc.noaa.gov/. 
5 https://github.com/anhlt18vn/Semantic2019. 
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Fig. 3. Insert throughput results 


throughput also increases. After reaching a peak point, the growth has minimised 
due to the decentralised nature of the system. 

From both the tests, we can observe that the system is competent in han- 
dling a large dataset due to its decentralised ecosystem. We have also seen that 
the system’s performance becomes stagnant after attaining peak. It is mainly 
because the decentralised system adheres to the network latency, which occurs 
due to its replication and distribution strategies. With the current architecture, 
the system has proven highly scalable, though it is acknowledged that there has 
been a trade-off in terms of performance to achieve this nature of the degree of 
scalability. 


Figure 4 shows the result of experiment 2. 
We can see that the system deliver very good 
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5 Conclusion and Outlook 


This paper presents the first implementation of a novel RDF distributed store 
with blockchain technology for the decentralised edge network. The experiments 
proved that the system can be deployed on a network of lightweight edge devices 
such as Raspberry Pis. With a network of fifteen nodes of Raspberry Pi, the sys- 
tem is able to host a dataset up to a billion RDF triples. The cost for deploying 
such system is quite competitive and flexible as a Pi node costs less than €50 
and the number of nodes can be elastically increased or decreased at runtime. 
Next step, we will increase tenfold in the number of nodes to study the scal- 
ability and limitations. For the shortcomings shown in Fig.4, the processing 
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throughput causing long delays on the queries that use a large number of data 
blocks can be increased by new achievement of transaction throughput, e.g. 20k 
transactions/second in [2]. 
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Abstract. We present an approach to verify off-chained information 
using Linked Data, Smart Contracts, and RDF graph hashes stored on a 
Distributed Ledger. We use the notion of a Linked Pedigree, i.e. a decen- 
tralised dataset for storing hyperlinked information, as modelling foun- 
dation. We evaluate our approach by comparing different ways to build 
the Smart Contract. We develop a cost model and show, based on our 
implementation, that for managing multiple Linked Pedigree instances, 
a single larger Smart Contract is superior to multiple smaller Smart 
Contracts for supply chains shorter than 50 participants. 


1 Introduction 


Chained value-creation networks are commonplace in many industries. Consider 
e.g. supply chain networks in logistics or production systems, where goods and 
services are handed over decentrally between different independent parties to 
deliver goods and services to the customer. In such networks, transparency is 
gaining importance. Customers demand verifiable’ information on where their 
food comes from (track & trace), or recall campaigns need to be organised fast 
and specifically. Recently, distributed ledger-based solutions have gained atten- 
tion, e.g. TradeLens by IBM and Maersk for global trade networks”. But shar- 
ing information on a distributed ledger may not always be desirable: As in a 
distributed ledger, every participant stores a copy of the whole ledger, data 
sovereignty and privacy become an issue. Moreover, storing data on the dis- 
tributed ledger is expensive, which calls for so-called “off-chaining” of data [1], 
i.e. storing data outside of the distributed ledger while keeping the distributed 
ledger in the loop by storing hashes on the ledger. For off-chaining, to not com- 
plicate matters, a uniform access mechanism would be desired. Linked Data is a 


1 To verify: “Make sure or demonstrate that (something) is true, accurate, or justified.” 
(Oxford Living Dictionary) We assume truth of the information on the distributed 
ledger to be established. Then, we allow for verification that the information has not 
been changed. 

? “IBM teams with Maersk on new blockchain shipping solution”, https://tcrn.ch/ 
2vRLFLT. 
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light-weight standard-based way to publish data in a decentralised fashion, where 
access control can be easily implemented. Hence, we ask: Can we combine the 
verification capabilities of the distributed ledger with Linked Data management? 

Transparently provided information is important, e.g. in the food sector, 
where society demands more transparency regarding details on products and 
their transportation®. More general, in retail, the transparency in production 
and transport of consumer goods and retail products is a important factor of 
customer decisions*’°. Regulation authorities discuss such product transparency 
and documentation to be required in the future®. But that information needs 
not just to be public. Customer trust needs to be ensured, where structural 
assurances [2] such as the mathematical foundations of distributed ledgers can 
serve as basis. Publicly shared information has high economic potential in the 
logistics domain, e.g. by addressing the bullwhip effect, but is hindered by the 
need for privacy of businesses [5,7]. Hence, a more cautious approach to share 
data, like disclosing data only to a selected number of persons, may unlock 
some of the benefits. But even if organisations are willing to share information, 
interoperability of the information systems is an issue [5,7, 12]. Hence, the flexible 
data model of RDF and the standardised light-weight protocol HTTP can reduce 
friction. If RDF is not available yet in an organisation, lifting of existing data 
to semantic models has been proposed for the supply chain domain in [4]. 

Previous works in the intersection of Semantic Web and Distributed Ledger, 
e.g. at the Linked Data and Distributed Ledgers workshop series (LD-DL)’ 
have not considered off-chaining of data. Previous works in off-chaining of data 
are often built using distributed hash tables [1], where the problem of data 
sovereignty arises just like with storing data on the chain. 

Our approach consists in the following parts (this unique combination and 
2, 4, 5, and 6 are the contributions of this paper): 


1. We use Linked Data, i.e. RDF accessible using HTTP to store data off-chain 
in a decentralised fashion. Access control for data privacy can be layered on 
top, e.g. using HTTP authentication, or more recent approaches such as Web 
Access Control® or WebID+TLS?. 

2. We present a vocabulary that extends the Linked Pedigree ontology [10] 
to describe a product’s handover history and the Ethereum Ontology’? to 
describe an Ethereum distributed ledger. 


3 https://www.forbes.com/sites/gmoanswers/2015/11/30/transparency-no-longer- 
optional/. 

t https: //www.labelinsight.com /hubfs/Studies%20and%20Reports/2016-LI-Food- 
Revolution-Study.pdf. 

5 https://www.pwc.de/de/handel-und-konsumguter /assets/bevoelkerungsbefragung- 
rueckverfolgbarkeit-als-kaufargument.pdf. 

6 https://www.euractiv.com/section /agriculture-food/news/food-safety-midnight- 
deal-for-revised- general-food-law. 

T Browse with http://events.linkeddata.org/Idow-lddl/ as entry point. 

8 https: //www.w3.org/wiki/WebAccessControl. 

® https: //www.w3.org/2005/Incubator /webid/spec/tls/. 

10 http: //ethon.consensys.net /. 
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3. We use the RDF graph hashing approach of [3] to connect the off-chained 
data with the distributed ledger. 

4. We present a link-traversal based querying approach for verifying data on a 
Linked Pedigree off-chain. 

5. We present a Smart Contract, i.e. code that can be executed on the Dis- 
tributed Ledger, for verifying data using the Distributed Ledger. 

6. We present a protocol to apply all of the above. 


The paper is structured as follows:First, we survey related work (Sect. 2). 
Next, we present an example (Sect. 3), which also introduces the protocol. Sub- 
sequently, we present the foundational definitions, on which we base our approach 
(Sect.4). Then, we describe the components of our approach (Sect.5), that is 
the vocabulary, the smart contract, and the graph traversal. We next evaluate 
our approach (Sect.6) by developing a cost model, which we instantiate using 
an implementation. We then discuss our findings (Sect.7). Last, we conclude 
(Sect. 8). 


2 Related Work 


In the intersection of supply chain and distributed ledger, there are two major 
initiatives started in collaboration with IBM. Both initiatives are based on the 
distributed ledger Hyperledger: TradeLens for global freight companies, and 
FoodTrust for agricultural goods. Both approaches have similar characteristics: 
All information (e.g. document filings, supply chain events, authority approval 
status, ...) is stored on the distributed ledger. As all nodes that are part of the 
distributed ledger have a full copy of the ledger, this hints at scalability issues. 
Both solutions support access rights to this data on the ledger. TradeLens is 
citing data interoperability as a challenge. While they incrementally move to 
UN’s CEFACT vocabulary!'!, our approach allows for using semantic technolo- 
gies to achieve data interoperability using mappings between schemas. Similarly, 
provenance.org, an online service for track and trace of goods using a distributed 
ledger, stores all data on the ledger. 

In the intersection of semantic technologies and distributed ledgers, differ- 
ent ontologies have been proposed to describe a distributed ledger: There is, 
e.g. GraphChain [11], BLONDiE!’, and EthOn!®. Our approach uses parts of 
EthOn. Besides defining an ontology, the GraphChain [11] approach also allows 
to distribute RDF data onto a distributed ledger. Our approach however requires 
data to be provided as Linked Data, irrespective of the back-end. 

In the intersection of semantic technologies and supply chain, the Linked 
Pedigree approach has been developed [10]. Linked Pedigrees are RDF graphs 
to describe trails of ownership of goods provided via HTTP. Moreover, the 
paper contains a protocol for using the thus described data in a supply chain. 


11 https: //blog.tradelens.comascomm/news/why-interoperability-matters/. 
12 https: //github.com/hedugaro/Blondie. 
13 http: //ethon.consensys.net /. 
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Our approach adds verification using distributed ledger technologies and hashing 
to Linked Pedigrees. 


3 Example 


We next describe an example to illustrate our approach. Imagine the following 
three steps in a simple supply chain: 


Item Creation: A fisherman creates an item, i.e. some fish. 

Item Handover: The fish is handed over between supply chain partners, e.g. 
from the fisherman to a trucker to a local supermarket to the consumer. 
Item Verification: At the store, the consumer verifies information about the 
fish as a decision-making support for the purchase. Verification could also be 

performed during each handover. 


For the illustration, we look at the information transferred during these three 
steps: Within the first two steps, i.e. item creation and item handover, the item’s 
physical history is described and published as Linked Data. The third step of item 
verification solely corresponds to the verification of that published information, 
and does not involve checking on the physical item itself. For brevity of the 
example, we leave out verification during the handover steps. The overall protocol 
is depicted in Fig. 1. The top left group starting with “create item” in bold relates 
to the item creation. The next group starting with “transport item” relates to 
the handover. With “store item”, the step for verifying the data starts, ending 
in the actual purchase. 


3.1 Item Creation 


The fisherman creates an supply chain item by catching the fish. They record 
information on the item and the catching process, e.g. fishing ground and time, 
builds an RDF graph from the information, and publishes the graph via HTTP. 
Thus, the initial part of the Linked Pedigree on the fish is formed. From this 
point, the creation procedure is the same as for any item handover in the supply 
chain. 


3.2 Item Handover 


When the fish is handed over, e.g. from the fisherman to the trucker that carries 
the fish to the market, an RDF graph with information on the hand-over is 
created and stored in the Linked Data store of choice of the party that owns the 
fish before the hand-over. The information is linked to the RDF graph describing 
the previous Linked Pedigree part, which contains an event that concerns this 
fish. Thus, we form a hyperlinked graph of the fish’s product trail. Additional 
information may be included ad libitum in each step, e.g. information on the 
item’s creation. For later verification purposes, a hash of the information is put 
into the Distributed Ledger using a Smart Contract. 
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3.3 Item Verification 


Before actually buying the fish, the consumer may want to ascertain if the 
fish’s information has not (maliciously) been tampered with, e.g. a retrospec- 
tive adjustment to the cooling information was made. To this end, the consumer 
looks up the fish’s information, which the supermarket provides in the form of a 
URI of a Linked Pedigree part. This Linked Pedigree part contains a reference 
to the previous part, which the end-consumer now dereferences. Consulting a 
Smart Contract, the customer can determine whether the retrieved information 
has not been changed since it was first published. By following the links in a 
Linked Pedigree to the respective previous Linked Pedigree and by dereferencing 
the corresponding identifiers, the customer can go back in the information trail 
on the fish right until the very beginning, i.e. the catchment. In each step, the 
customer can consult the Smart Contract to verify the integrity of the informa- 
tion provided. 

This verification can be performed analogously at any point in the supply 
chain by any participant, starting at different points in the traversal. 


4 Preliminaries 


We base our approach on Linked Data, i.e. we make use of URIs, and provide 
hyperlinked RDF graph via HTTP. We build Linked Pedigrees in the form of 
RDF graphs. We store RDF graph hashes in a Distributed Ledger based on 
Ethereum using a Smart Contract. 


4.1 Linked Data, URIs, RDF, and HTTP 


We following the Linked Data principles!*: We use Uniform Resource Identi- 
fiers! (URIs) as names for things. We use graphs expressed according to the 
Resource Description Framework!® (RDF) to describe things. An RDF graph 
is defined as a set of triples. With U as the set of all URIs, B as the set of 
all blank nodes, and £ as the set of all literals, a triple t can be defined as 
te (UUB) xU x (UUBUL). In our examples, we use CURIEs!” that allow 
to abbreviate URIs using prefixes!® . We use the Hypertext Transfer Protocol!’ 
(HTTP) to dereference URIs and assume dereferencable URIs. 


14 https: //www.w3.org/DesignIssues/LinkedData. 

15 https: //www.ietf.org/rfc/rfc3986.txt. 

16 https: //www.w3.org/TR/rdfl1-concepts/. 

17 https: //www.w3.org/TR/curie/. 

18 We point to prefix.cc for resolving the prefixes. Moreover, we use p: as short for 
http://purl.org/pedigree, e: as short for http://ethon.consensys.net/, and x:, as 
short for http://people.aifb.kit.edu/co1683/2019/ld-chain/vocab for our extensions. 

19 https: //www.ietf.org/rfc/rfe7230.txt. 
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4.2 Linked Pedigree 


A Linked Pedigree [10] is a trail of ownership of a product published as Linked 
Data described using terms from the OntoPedigree ontology. Each Linked Pedi- 
gree consists of different parts, i.e. instances of the class p:Pedigree, which 
reflect the different owners. The parts are assumed to be linked using the 
p:hasReceivedPedigree property. As each owner of a product may choose a 
storage provider of their liking, the Linked Pedigree can be regarded as a decen- 
tralised dataset. Each part of a Linked Pedigree bears a status, p: Initial, 
p: Intermediate, or p:Final. We show the terms of the OntoPedigree ontology 
that we use in this paper as part of Fig. 2. 


4.3 Hashing RDF Graphs 


To hash RDF graphs, we apply the approach of Hogan [3]. The approach allows 
for determining stable hashes of RDF graphs in the presence of isomorphism- 
preserving transformations of the graph, i.e. triple re-ordering and blank node 
renaming. 


4.4 Distributed Ledger Technologies 


Distributed Ledger Technologies is the umbrella term for distributed ledger con- 
cepts like blockchain or transaction-based directed acyclic graphs [6]. A dis- 
tributed ledger is a distributed database in a decentralised network, where 
changes to the database, i.e. transactions, have to be approved by network nodes 
via a consensus algorithm [8]. This allows for secure processing of transactions 
between parties that do not trust each other. Furthermore, when new data is 
appended to the distributed ledger, timestamps and hash-based references to pre- 
vious data are included. This meta data leads to a high degree of data integrity 
and imposes a high effort on retrospective modification of data [8]. In addition, 
as the database is replicated in full, every network participant can query their 
instance of the distributed ledger. Hence, all data and all associated changes are 
transparent to the entire network. 


Ethereum Blockchain. For our work, we choose Ethereum, a well-established 
blockchain implementation, that allows for the deployment of decentralised appli- 
cations via Smart Contracts?°. Ethereum allows for building private proof-of- 
work blockchains. Proof-of-work is a consensus algorithm based on expensive 
compute operations, which need to be executed for the approval of blocks of 
transactions. Participating in the consensus creation, i.e. approving blocks of 
transactions following a specified algorithm, here proof-of-work, is also referred 
to as “mining”. 


20 https: //github.com/ethereum/wiki/wiki/White-Paper. 
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Closely connected to the mining process in an Ethereum network is 
Ethereum’s internal cryptocurrency called “Ether”. Ether is used to pay transac- 
tion fees. Whenever a transaction is issued, the miner who approves the transac- 
tion is to be compensated for lending his computing power to the network. This 
network utilisation is measured in “gas”, ether’s internal utility value. There- 
fore, costs are typically given in gas. However, in private blockchain networks 
the amount of computing power necessary for proof-of-work based consensus can 
be set to a reasonably low level, such that transaction fees as well as energy cost 
for computation of the proof-of-work algorithm are kept within limit. 


Ethereum Smart Contracts. Ethereum also allows for the deployment of 
Smart Contracts. Smart Contracts allow for defining application logic that can 
be executed directly on the distributed ledger. Applications built as Smart Con- 
tracts are thus sometimes called “decentralised applications”. A Smart Contract 
can be regarded as application logic that executes automatically during mining 
when the conditions of the contract are met [13]. From a programming perspec- 
tive, a Smart Contract is a piece of code that is stored on a distributed ledger 
and executed in a decentralised manner, i.e. local execution, then synchronising 
and consenting on the resulting database change, if any, with the network. 


5 Technical View on Key Components 


In the following, we will present our approach from a technical perspective. 
We first elaborate the model of a Linked Pedigree and its Ontology to model an 
item’s creation and handovers among supply chain partners. Then, we outline the 
implemented Smart Contract’s functionality that enables for the item verifica- 
tion process. Finally, we present our Linked Graph Traversal algorithm, thereby 
explaining the procedure of item verification and Linked Pedigree retrieval in 
detail. 


5.1 Vocabulary 


In each Linked Pedigree part that is not an initial Linked Pedigree part, the prop- 
erty p:hasReceivedPedigree specifies the respective previous Linked Pedigree 
part by its URI. When additional information is desired to be verifiable as well, 
additional triples can be added ad libitum. For verifying the information on the 
Linked Pedigree using the Distributed Ledger, we have to add information on 
where to verify the information. To this end, we built an ontology by taking 
selected parts from the OntoPedigree ontology, added terms from the EthOn 
ontology, and invented new terms. A depiction of our overall data modelling can 
be found in Fig. 2. 


5.2 Smart Contract 


Our Smart Contract offers three functions: First, RDF graph hashes of Linked 
Pedigree parts can be stored together with their URI on the distributed ledger. 
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p:Pedigree 


tp eceivedPedigree 
p:hasProductinfo /p:hasPedigreeStatus |dc:publisher \p:pedigreeCreationTime \x:hasValidationInfo 
<<enumeration>> 
rdfs:Resource p:PedigreeStatus dc:Agent x:ChainInfo 
p:Initial xsd:dateTime 
p:Intermediate — 
p:Final 
x:hasContractAccount \x:hasBootNode 
e:ContractAccount x:BootNode 
pein e:partOf je:minesFor \x:hasEnodeURI 
v 
e:Network e:Blockchain e:FullNode 
xsd:hexBinary xsd:anyURI 
frome e:containsBlock 


e:GenesisBlock 


xsd:integer 


Fig. 2. The vocabulary we use for our approach. We use an UML class diagram to illus- 
trate modelling in RDFS using the following correspondence: UML’s class, association, 
and inheritance map to rdfs:Class, rdfs:domain and rdfs:range of an rdf:Property, and 
rdfs:subClassOf relationships respectively 


Further, these hashes can be looked up from the distributed ledger using their 
associated URI. Finally, the URI of a single Linked Pedigree part can be looked 
up using its direct successors’ URI. 


Storing Hashes. An agent, requesting the Smart Contract to store a hash of 
a Linked Pedigree part, must provide the following arguments: 


— The hash itself 

— The URI of the Linked Pedigree part (required to enable for look ups of the 
hash by its Linked Pedigree part URI) 

— The URI of the previous Linked Pedigree part (needed in order to append 
the current part’s URI to the correct Linked Pedigree) 

— The wallet of the next owner (required for rights management, specifically 
we thus can restrict writing information on this Linked Pedigree to the next 
owner) 


In Fig. 1, the calls that “issue [a] transaction” are storing hashes. Once stored, 
the Smart Contract does not allow for hashes to be altered or removed. 

A request for storage of a hash results in a transaction on the distributed 
ledger issued by the Smart Contract. Therefore, the requesting agent has to pay 
a transaction fee in order to compensate for the required network utilisation. 


Retrieving Hashes. To enable for a verification process by hash comparison, 
a hash can be retrieved for the RDF graph of a Linked Pedigree part by calling 
the Smart Contract using the part’s URI. Such look-ups are characterised using 
“read call” in Fig. 1. Since this look up can be carried out without a transaction 
to the network, no transaction fee applies here. 
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Retrieving URIs. An agent may not be able or authorised to dereference the 
URI of a Linked Pedigree part. The Smart Contract offers a fall-back function 
for looking up the corresponding previous part’s URI. This way, unavailable 
Linked Pedigree parts can be skipped, thereby keeping the traversable chain of 
URI references intact. Again, since any agent ought to be able to look up URIs, 
retrieval of URIs via the Smart Contract is unrestricted. We omitted such calls 
to the Smart Contract from Fig. 1. As for retrieving a hash, the Smart Contract 
for URI retrieval does not need to invoke transaction, again, no transaction fee 
applies. 


5.3 Link Traversal and Data Verification 


To retrieve and verify a specific Linked Pedigree, an agent starts with the URI 
from the Linked Pedigree they know to be the last in the chain. They can then 
obtain the RDF graph that describes this Linked Pedigree part using an HTTP 
GET request. From this RDF graph, the agent calculates a hash value using the 
blabel approach from [3]. We use the implementation available online?!. At the 
same time, the agent retrieves the stored hash for this URI from the Distributed 
Ledger using the Smart Contract. The agent can then verify the information by 
comparing the hash they generate to the hash retrieved from the Smart Contract. 

To go further in the history of the item, the agent performs Link Traversal- 
based querying intertwined with verifying as just described: For a part p, the 
agent queries the RDF graph about the p for triples with p as subject and 
p:hasReceivedPedigree as predicate. Then, the agent finds the URI of the 
previous Linked Pedigree part in object position. With this URI, the agent per- 
forms dereferencing, verifying, and querying as described, until the initial Linked 
Pedigree part, i.e. the part with p: Initial status, is reached. 

The traversal algorithm may terminate exceptionally, e.g. when Linked Pedi- 
gree parts are unavailable due to outages, or insufficient rights for the agent. 
However, for each Linked Pedigree part URI the previous Linked Pedigree part’s 
URI can be looked up using the Smart Contract. This allows for skipping of 
unavailable parts. 

By traversing backwards on this chain of URIs, the item’s whole Linked 
Pedigree is retrieved, see the HT'TP-GET requests from the End Consumer to 
Trucker and Fisher in Fig. 1. If all hash pairs match, the whole Linked Pedigree 
can be regarded as verified. Additionally provided links can be looked up for 
more information, e.g. on the item itself, its production or its transportation, if 
corresponding access rights are granted. 


6 Evaluation 


To evaluate our approach, we focus on finding the best way to implement our 
concept within the chosen environment of a private Ethereum proof-of-work 


21 https: //blabel.github.io/. 
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network. We do not compare our approach in technical terms, e.g. transactions 
possible per second, to an existing DLT-based solution, since our approach only 
uses a standard Ethereum implementation as infrastructure. When choosing a 
different infrastructural environment, e.g. a permissioned blockchain or a proof- 
of-authority consensus, other evaluation criteria may apply. 

We contrast two ways of achieving the presented functionality using Smart 
Contracts. Looking at our supply chain example, we ask if the deployment of 
a Smart Contract for the whole supply chain network, or a more fine-grained 
approach of multiple Smart Contracts is more beneficial. For clarity of the pre- 
sentation, we name the approach with a Smart Contract that manages hashes 
and URIs for multiple items in the network, a “Multi-Item-Contract” (MIC). 
The alternative, a “Single-Item-Contract” (SIC), is a Smart Contract that is 
used for validating one item exclusively. 

For the evaluation, we first build a cost model to compare two approaches 
regarding operating cost and storage overhead. We then instantiate the cost 
model experimentally. 


6.1 Cost Model 


Let C denote a set of Smart Contracts that is deployed on the blockchain. Let 
T C U denote the set of URIs that identify single item instances, for each of 
which a Linked Pedigree exists. Let P C U denote the set of URIs that identify 
single Linked Pedigree parts. 

We define a function g : P — P that maps a Linked Pedigree part pz € P to 
another p; € P, where k # j. Thus, function g appends the Linked Pedigree part 
pk to a previous Linked Pedigree pj. This results in chains of Linked Pedigree 
parts that we formally describe as n-tuples. A single chain, i.e. n-tuple, forms 
an item’s Linked Pedigree. Be A the set of all Linked Pedigrees is A C P”. An 
item 7’s Linked Pedigree A; € A is then an n-tuple of the form: 


A; = (Po, Ply sees Pn) E P” 
Each Linked Pedigree has an initial element po € P, where 


(g(p;) = po, Ip; E P) A (g(po) = Pe, =3pz € P) 


and consists further of n-1 elements pp € P, where 


g(Pr) = Pk-1; Ym € {1, 2, a. n} 


Further, we define h as the bijective mapping between an item’s URI and its 
Linked Pedigree h : A — T. Last, we define the funtion e : T — C, which maps 
an item i € Z to a Smart Contract cj € C, since each item is validated by a 
Smart Contract. 

We thus defined three dimensions of our approach, which we use in the evalu- 
ation: the set of deployed Smart Contracts C, the set of items Z (Linked Pedigree 
equivalent), and the n-tuples of Linked Pedigree parts (each forming a Linked 
Pedigree): 

{C,Z,P”}. 
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6.2 Applying the Cost Model 


Applying the model {C,7,P”} to the question at hand, whether a MIC or a SIC 
approach is preferable, we make the following assumptions: 

The number of deployed Smart Contracts |C| is variable. Further, the number 
of supply chain items is growing over time due to ongoing business: 


jim |Z| = ov. 
The same holds for the total number of Linked Pedigree parts, since 
|P"| = dim(P") x |Z], 


each item having a Linked Pedigree with n elements. For our evaluation we will 
assume a constant average size of a Linked Pedigree dim(P”) = n. 

In the following, we compare two approaches regarding operating cost and 
storage overhead: The first one is a MIC approach with a constant |C| = 1. The 
alternative is a SIC approach with an over time growing |C] = |Z]. 


Operating Cost. When deploying the Smart Contract or issuing a transaction, 
the computing power lend from the network’s miners needs to compensated by 
a transaction fee. Therefore, the usage of Smart Contracts is associated with 
operating cost. 

To compare the operating cost of a MIC approach and a SIC approach, let 
da denote the average deployment cost for approach a. Let ra denote an item’s 
average registration cost of for approach a, which is simply the cost of storing 
the initial Linked Pedigree part. Let sa denote the average cost of storing a hash 
of an intermediate Linked Pedigree part for approach a. Then for an approach 
a, the cost function 


pa(C,Z) = da X |C| + (Ta + Sa x (n — 1)) x |Z|, a € {MIC, SIC} 
applies. For the MIC approach, we have |C| = 1 resulting in the MIC cost function 
puic(Z) =dyict (Cavatel + SMIC X (n — 1)) x IZ]. 


For the SIC approach, we have |C| = |Z| since there is a deployment of a Smart 
Contract per item. This results in the SIC cost function 


pstc(Z) = (dstc +rsico + SsIc x (n — 1)) x IZ]. 


By comparing the two cost functions, we can see that a growing |Z| leads to 
higher operating cost for a SIC approach due to deployment cost typically being 
far greater than function execution cost. So, an increasing number of supply 
chain items |Z| favours a MIC approach. 

Further, equating both cost functions pyrc(Z) = psrc(Z) leads regarding 
the number of supply chain items to 

IZ|*(n) dic 


-dsrc + (rsic — ruic) + (n -— 1)(ssrc — smrc) 
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with |Z|*(m) being the number of supply chain items, where both approaches 
are at equal operating cost, dependent on the number of parts in a Linked 
Pedigree. |Z|*(n) shows that the more parts form a Linked Pedigree instance, 
i.e. the greater n, the less desirable is a MIC deployment due to the slightly lower 
storing cost of a SIC. 

For our implementation, the following estimated?? gas cost apply: 


dmic = 1,065, 000 >; TMIC = 95,000 >; SMIC = 185, 000 


dsrc = 750, 000 >; Tg = 80, 000 >; SMIC = 170, 000 
These estimated gas costs lead to 


7 1,065, 000 
~ 750,000 + 15,000 x n 


IZ|"(n) 


Here, n = 50 is the vertical asymptote of |Z|*(n), meaning that for a Linked 
Pedigree length of less than 50 parts per single Linked Pedigree a MIC-based 
approach outperforms a SIC-based one. 


Storage Overhead. When deploying a Smart Contract, the Smart Contract’s 
code is stored on the distributed ledger. Every network participant with a full 
node?’ stores therefore a copy of that Smart Contract’s code. 

To formally compare the storage overhead of a MIC approach and a SIC app- 
roach, let Sa denote the storage space needed per Smart Contract for approach 
a. Let u denote the storage space needed per URI (item plus Linked Pedigree 
part), which is independent of the approach taken. Let further h denote the 
storage space needed per hash, which is also independent of the approach taken. 
Then for an approach a, the following storage overhead function applies for one 
network participant: 


0a(C,Z,P") = sa X |C|+ u x (Z| + |P"]) +h x |P”|, a € {MIC, SIC}. 


When omitting approach independent variables, one network participant’s stor- 
age space overhead function for deployed Smart Contracts remains 
6a(C) = Sa X |C|,a € {MIC, SIC}. 

For our (granted simple) implementation, the Smart Contracts lead to the 

following (in bytes): 
SMIC = 3,300 s SssiC = 2,300 

Obviously, for our implementation a MIC approach is superior to a SIC approach 
already for only two supply chain items. 


22 Estimates for URIs of 100 characters length. 
23 As opposed to a light node, which only stores a flat copy, i.e. hash values, of the 
distributed ledger. 
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7 Discussion 


Applying the cost model on operating cost and storage overhead, we show that 
using network wide MIC is economically preferable as opposed to using a SIC if 
we consider networks with average supply chain length below 50. Above 50, the 
cost of an additional SIC, including deployment and Linked Pedigree storing, is 
smaller than the overall cost of storing an additional Linked Pedigree in a MIC. 
Note, that the length of 50 is the number of hops of an item between supply 
chain participants. In contrast, the network’s size, i.e. the number of participants 
in general, does not affect our model. 

There may be special use cases, where deploying multiple Smart Contract 
instances may be desirable in general, e.g. when the Smart Contracts are required 
to interact with each other or when functionality does not fit all participants’ 
needs. With that, a SIC-based seems to be more flexible than a MIC-based. 
However, also in a MIC-based deployment, updates can be performed, yet they 
then need to appeal to the entire user base. 

It is then in any case the obligation of the business partners to agree on 
which Smart Contract instance to use. Especially in large supply chains, this 
might cause significant overhead cost. Therefore, proposing a standard Smart 
Contract, that is already deployed and ready for usage, might facilitate business 
making in the network. 


8 Summary and Conclusion 


We presented an approach to verify the integrity of hyperlinked information 
using Linked Data and Smart Contracts, where Linked Data is used to store 
data off the chain. We showed a protocol for the verification in the presence of 
the transfer of physical goods, outlined the technical aspects of our approach and 
evaluated our approach using a cost model we developed. The implementation 
of our approach can be found online”. 

We see wide application possibilities of our approach in decentrally organised 
logistics networks with many participants of small size who desire on-premise 
data storage and acces control. As our presented approach allows for verifying 
information published as Linked Data, we contribute to the often neglected layer 
trust of the semantic web stack [9]. 
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